Spark Action to Analyze table #10288

karuppayya · 2024-05-08T05:59:24Z

This change adds a Spark action to Analyze tables.
As part of analysis, the action generates Apache data - sketch for NDV stats and writes it as puffins.

karuppayya · 2024-05-08T05:59:53Z

cc: @RussellSpitzer @aokolnychyi @huaxingao @findepi

ajantha-bhat · 2024-05-08T06:12:17Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Computes the statistic of the given columns and stores it as Puffin files. */


AnalyzeTableSparkAction is a generic name, I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Either we can change the change the naming to computeNDVSketches or make it generic such that any kind of stats can be computed from this.

Thinking more on this, I think we should just call it computeNDVSketches and not mix it with partition stats.

I tried to follow the model of RDMS and Engines like Trino using ANALYZE TABLE <tblName> to collect all table level stats.
With a procedure per stats model, the user have to invoke procedure/action for every stats and
also with any new stats addition, the user need to ensure to update his code to call the new procedure/action.

not mix it with partition stats.

I think we could have partition stats as a separate action since it per partition, whereas this procedure can collect top level table stats.

@karuppayya
I can see the tests in TestAnalyzeTableAction, it's working fine.
But have we tested in Spark, whether its working with a query like -
"Analyze table table1 compute statistics" ?

Because generally it gives the error
"[NOT_SUPPORTED_COMMAND_FOR_V2_TABLE] ANALYZE TABLE is not supported for v2 tables."

Spark doesnot have the grammar for Analyzing tables.
This PR introduces a Spark action. In subsequent PR, i plan to introduce a iceberg procedure to invoke the Spark action.

I'll raise a PR for the spec update if there's no objections?

thanks!

@karuppayya Thanks for the great work! Sorry I didn't have time to take a look at your PR earlier. For ANALYZE table, Spark has the following syntax:

ANALYZE TABLE table_identifier [ partition_spec ] COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col [ , ... ] | FOR ALL COLUMNS ]

For column-level stats, Spark computes

NDV

Max (for numeric, Date, and Timestamp only)

Min (for numeric, Date, and Timestamp only)

Null Count

Avg Len

Max Len

We probably want to ensure that Iceberg implementation aligns with Spark's current functionality.

Hi @huaxingao do you mean, that this PR should take care of the other stats too apart from NDV, like MIN, MAX, NULL Count, etc ?

Hi @karuppayya are we planning to induce the other column-level stats in this PR

@jeesou , This PR is only for the NDV stats. (PR to propagate the stats in scan)

ajantha-bhat · 2024-05-08T07:10:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+              spark(), table, columnsToBeAnalyzed.toArray(new String[0]));
+      table
+          .updateStatistics()
+          .setStatistics(table.currentSnapshot().snapshotId(), statisticsFile)


what if table's current snapshot has modified concurrently by another client between like 117 to line 120?

this is a good question. we should do #6442

ajantha-bhat · 2024-05-08T07:14:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+
+  public static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches(
+      SparkSession spark, String tableName, String... columns) {
+    String sql = String.format("select %s from %s", String.join(",", columns), tableName);


I think we should also think about incremental update and update sketches from previous checkpoint. Querying whole table maybe not efficient.

Yes, incremental need to be wired into the ends of write paths.
This procedure could exist in parallel, which could get stats of the whole table on demand.

ajantha-bhat · 2024-05-08T07:16:11Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    assumeTrue(catalogName.equals("spark_catalog"));
+    sql(
+        "CREATE TABLE %s (id int, data string) USING iceberg TBLPROPERTIES"
+            + "('format-version'='2')",


default format version itself v2 now. So, specifying it again is redundant.

ajantha-bhat · 2024-05-08T07:17:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID()));
+    OutputFile outputFile = fileIO.newOutputFile(path);
+    try (PuffinWriter writer =
+        Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) {


I like this name instead of "analyze table procedure".

ajantha-bhat · 2024-05-15T10:41:15Z

there was an old PR on the same: #6582

huaxingao · 2024-05-15T15:02:00Z

there was an old PR on the same: #6582

I don't have time to work on this, so karuppayya will take over. Thanks a lot @karuppayya for continuing the work.

amogh-jahagirdar

Thanks @karuppayya @huaxingao @szehon-ho this is aewsome to see! I left a review of the API/implementation, still have yet to review the tests which look to be a WIP

amogh-jahagirdar · 2024-05-29T17:13:43Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param statsToBeCollected set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);


Should these stats be a Set<StandardBlobType> instead of arbitrary Strings? I feel like the API becomes more well defined in this case.

Oh I see, StandardBlobType defines string constants not enums

amogh-jahagirdar · 2024-05-29T17:16:54Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private void validateColumns() {
+    validateEmptyColumns();
+    validateTypes();
+  }
+
+  private void validateEmptyColumns() {
+    if (columnsToBeAnalyzed == null || columnsToBeAnalyzed.isEmpty()) {
+      throw new ValidationException("No columns to analyze for the table", table.name());
+    }
+  }


Nit: I think this validation should just happen at the time of setting these on the action rather than at the execcution time.

amogh-jahagirdar · 2024-05-29T17:19:51Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


I also think this interface should have a snapshot API to allow users to pass in a snapshot to generate the statistics for. If it's not specified then we can generate the statistics for the latest snapshot.

Should we support branch/tag as well? (I guess in subsequent pr)

the snapshot(String snapshotId) has been added

for branch/tag -- is it existing pattern to support this first class in APIs, or require the caller to convert the information they have (branch/tag) into a snapshot ID?

Good point, snapshot should be fine.

amogh-jahagirdar · 2024-05-29T17:22:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+          if (field == null) {
+            throw new ValidationException("No column with %s name in the table", columnName);
+          }


Style nit: new line after the if

amogh-jahagirdar · 2024-05-29T17:30:04Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      SparkSession spark, Table table, long snapshotId, String... columnsToBeAnalyzed)
+      throws IOException {
+    Iterator<Tuple2<String, ThetaSketchJavaSerializable>> tuple2Iterator =
+        NDVSketchGenerator.computeNDVSketches(spark, table.name(), snapshotId, columnsToBeAnalyzed);


Does computeDVSketches need to be public? Seems like it can just be package private. Also nit, either way don't think you need the full qualified method name

amogh-jahagirdar · 2024-05-29T17:34:48Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+
+public class ThetaSketchJavaSerializable implements Serializable {


Does this need to be public?

amogh-jahagirdar · 2024-05-29T17:35:06Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    if (sketch == null) {
+      return null;
+    }
+    if (sketch instanceof UpdateSketch) {
+      return sketch.compact();
+    }


Style nit: new line after if

amogh-jahagirdar · 2024-05-29T17:45:44Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                null,
+                ImmutableMap.of()));
+      }
+      writer.finish();


Nit: Don't think you need the writer.finish() because the try with resources will close, and close will already finish

i think you need to finish() to get the final fileSize() etc

amogh-jahagirdar · 2024-05-29T17:51:57Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                table.currentSnapshot().snapshotId(),
+                table.currentSnapshot().sequenceNumber(),
+                ByteBuffer.wrap(sketchMap.get(columns.get(i)).getSketch().toByteArray()),
+                null,


null means that the file will be uncompressed. I think it makes sense not to compress these files by default since the sketch will be a single long per column, so it'll be quite small already and not worth paying the price of compression/decompression.

since the sketch will be a single long per column

the sketch should be more than that, small number of kb iirc
Trino uses ZSTD for the blobs, and no compression for the footer.

amogh-jahagirdar · 2024-05-29T17:52:12Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      if (sketch1.getSketch() == null && sketch2.getSketch() == null) {
+        return emptySketchWrapped;
+      }
+      if (sketch1.getSketch() == null) {
+        return sketch2;
+      }
+      if (sketch2.getSketch() == null) {
+        return sketch1;
+      }


Style nit: new line after if

szehon-ho

Hi @karuppayya thanks for the patch, I left a first round of comments.

szehon-ho · 2024-05-29T20:39:04Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param columns a set of column names to be analyzed
+   * @return this for method chaining
+   */
+  AnalyzeTable columns(Set<String> columns);


Nit, how about String... columns (see RewriteDataFiles). same for the others

szehon-ho · 2024-06-05T00:00:43Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param statsToBeCollected set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);


Let's call statistics? Like StatisticsFile. https://iceberg.apache.org/contribute/#java-style-guidelines I think it can interpreted differently but I think point 3 implies we should make it have the full spelling if possible, and we dont have abbreviations for API methods in most of code.

Also statsToBeCollected => types ?

szehon-ho · 2024-06-05T00:05:59Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  AnalyzeTable columns(Set<String> columns);
+
+  /**
+   * A set of statistics to be collected on the given columns of the given table


The set of statistics to be collected? (given columns, given tables is specified elsewhere)

szehon-ho · 2024-06-05T00:07:08Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   */
+  AnalyzeTable snapshot(String snapshotId);
+
+  /** The action result that contains a summary of the Analysis. */


plural? contains summaries of the analysis?

Also if capital, it can be a a javadoc link.

szehon-ho · 2024-06-05T00:08:13Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


Should we support branch/tag as well? (I guess in subsequent pr)

szehon-ho · 2024-06-06T01:13:27Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                (PairFlatMapFunction<Iterator<Row>, String, String>)
+                    input -> {
+                      final List<Tuple2<String, String>> list = Lists.newArrayList();
+                      while (input.hasNext()) {


Can we use flatmap and mapToPair to make this more concise?

data.javaRDD().flatMap(r -> { List<Tuple2<String, String>> list = Lists.newArrayListWithExpectedSize(columns.size()); for (int i = 0; i < r.size(); i++) { list.add(new Tuple2<>(columns.get(i), r.get(i).toString()); } return list.iterator(); }).mapToPair(t -> t);

szehon-ho · 2024-06-06T01:16:53Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    return ImmutableAnalyzeTable.Result.builder().analysisResults(analysisResults).build();
+  }
+
+  private boolean analyzeableTypes(Set<String> columns) {


According to intellij, there is a typo (analyzable)

szehon-ho · 2024-06-06T01:17:23Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    final JavaPairRDD<String, ThetaSketchJavaSerializable> sketches =
+        pairs.aggregateByKey(
+            new ThetaSketchJavaSerializable(),
+            1, // number of partitions


Why limit to 1 ?

This code was just copied from datasketches example.
This value is used in the HashPartitioner behind the scenes.
Should we set it to spark.sql.shuffle.partitions?

szehon-ho · 2024-06-06T01:24:51Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    return sketches.toLocalIterator();
+  }
+
+  static class Add


can we use lambdas here for cleaner code? like

(sketch, val) -> { sketch.update(val;); return sketch; },

The next one may be too complex to inline but maybe we can reduce the ugly java boilerplate

szehon-ho · 2024-06-06T01:26:08Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                        final Row row = input.next();
+                        int size = row.size();
+                        for (int i = 0; i < size; i++) {
+                          list.add(new Tuple2<>(columns.get(i), row.get(i).toString()));


Question, does forcing string type affect anything? I see the sketch library takes in other types.

szehon-ho

Started a second round

szehon-ho · 2024-06-13T00:41:04Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+
+  @Override
+  public AnalyzeTable snapshot(String snapshotIdStr) {
+    this.snapshotId = Long.parseLong(snapshotIdStr);


I feel we should just make this take long

szehon-ho · 2024-06-13T00:45:40Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+
+      StatisticsFile statisticsFile =
+          NDVSketchGenerator.generateNDV(
+              spark(), table, snapshotId, columns.toArray(new String[0]));


Can we do a similar thing with a columns() method as I suggested with snapshots, that checks if the user list is null/empty and sets it to all the table columns

private Set<String> columns() { return (columns != null) && (columns.size() > 0) ? table.schema().columns().stream() .map(Types.NestedField::name) .collect(Collectors.toSet()) : columns; }

Then the logic is centralized here if we have more stats, rather than in NDV class.

szehon-ho · 2024-06-13T00:49:36Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+          .type(StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1)
+          .build();
+    } catch (IOException ioe) {
+      List<String> errors = Lists.newArrayList();


Are we only reporting error if we have IOException? (looks like from writing puffin file). It seems a bit strange just to catch this specific case, and not other exceptions.

It seems more natural to either catch all exceptions and report error, or else just throw all exceptions, what do you think?

szehon-ho · 2024-06-13T15:49:09Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private static final Logger LOG = LoggerFactory.getLogger(AnalyzeTableSparkAction.class);
+
+  private final Table table;
+  private Set<String> columns = ImmutableSet.of();


I see we convert this from array to set and back a few times (its passed in as array, stored as set, and then passed as array to NDVSketchGenerator.generateNDV function). Can we just keep this as array the whole time (store this as array here)?

arrays are mutable, so if we decide to switch to arrays, please make sure to defensive-copy whenever passing to another class

szehon-ho · 2024-06-13T15:58:02Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    return sketches.toLocalIterator();
+  }
+
+  static class Combine


I think we can do this to get rid of ugly Function2 definition, and also make the main method a bit cleaner.

JavaPairRDD<String, ThetaSketchJavaSerializable> sketches = pairs.aggregateByKey( new ThetaSketchJavaSerializable(), Integer.parseInt( SQLConf.SHUFFLE_PARTITIONS().defaultValueString()), // number of partitions NDVSketchGenerator::update, NDVSketchGenerator::combine); return sketches.toLocalIterator(); } public static ThetaSketchJavaSerializable update(ThetaSketchJavaSerializable sketch, String val) { sketch.update(val); return sketch; } public static ThetaSketchJavaSerializable combine( final ThetaSketchJavaSerializable sketch1, final ThetaSketchJavaSerializable sketch2) { if (sketch1.getSketch() == null && sketch2.getSketch() == null) { return emptySketchWrapped; } if (sketch1.getSketch() == null) { return sketch2; } if (sketch2.getSketch() == null) { return sketch1; } final CompactSketch compactSketch1 = sketch1.getCompactSketch(); final CompactSketch compactSketch2 = sketch2.getCompactSketch(); return new ThetaSketchJavaSerializable( new SetOperationBuilder().buildUnion().union(compactSketch1, compactSketch2)); }

szehon-ho · 2024-06-13T16:01:03Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    JavaPairRDD<String, ThetaSketchJavaSerializable> sketches =
+        pairs.aggregateByKey(
+            new ThetaSketchJavaSerializable(),
+            Integer.parseInt(


Looks like we may be able to skip passing this , and rely on Spark defaults?

Can you do a bit of research to verify?

szehon-ho · 2024-06-13T16:20:42Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+        Puffin.write(outputFile).createdBy("Iceberg Analyze action").build()) {
+      for (String columnName : columns) {
+        writer.add(
+            new Blob(


@findepi @marton-bod would be good if you can take a look as well, to verify interop between Spark and Trino here?

Here it seems we are storing the serialized sketch, as is specified in the spec. Should we store ndv as well in 'metadata', as is specified in spec: https://github.com/apache/iceberg/blob/main/format/puffin-spec.md#apache-datasketches-theta-v1-blob-type (does this mean properties?)

thanks @szehon-ho for the ping
there are couple potential issues

the blob needs to have ndv property

the sketch needs to be updated with the standard byte[] representation of values (Conversions.toByteBuffer)

there should be inter-op test Spark Action to Analyze table #10288 (comment)

i am not exactly sure how what's the lifecycle of ThetaSketchJavaSerializable & whether this can impact the final results. need to re-read this portion

findepi · 2024-06-14T09:08:47Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param types set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable types(Set<String> types);


What are allowed values for the types parameter? How can someone interacting with the javadoc learn this?
is it "stats types", "blob types" or something else?
if "blob types", we could link to https://iceberg.apache.org/puffin-spec/#blob-types , but i don't think we can assume that all known blob types will be supported by the code at all times.

I think we should use the blob type in Action and stats type in the Procedure(from where we could map stats to its blob type(s)?)
For eg: if NDV supports 2 blob types and user wants to generate only one of those, that would still be possible from the actions.

findepi · 2024-06-14T09:10:04Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


the snapshot(String snapshotId) has been added

for branch/tag -- is it existing pattern to support this first class in APIs, or require the caller to convert the information they have (branch/tag) into a snapshot ID?

findepi · 2024-06-14T09:12:56Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

@@ -26,4 +29,8 @@ private StandardBlobTypes() {}
   * href="https://datasketches.apache.org/">Apache DataSketches</a> library
   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static Set<String> blobTypes() {


is it supposed to return "all standard blob types"?
should the name reflect that?

if we did #8202, would this new blob type be added to this method?

findepi · 2024-06-14T09:13:10Z

gradle/libs.versions.toml

@@ -33,7 +33,7 @@ azuresdk-bom = "1.2.23"
 awssdk-s3accessgrants = "2.0.0"
 caffeine = "2.9.3"
 calcite = "1.10.0"
-datasketches = "6.0.0"
+datasketches="6.0.0"


please revert

findepi · 2024-06-14T09:14:29Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private static final Logger LOG = LoggerFactory.getLogger(AnalyzeTableSparkAction.class);
+
+  private final Table table;
+  private Set<String> columns = ImmutableSet.of();


arrays are mutable, so if we decide to switch to arrays, please make sure to defensive-copy whenever passing to another class

findepi · 2024-06-14T09:37:54Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                table.currentSnapshot().snapshotId(),
+                table.currentSnapshot().sequenceNumber(),
+                ByteBuffer.wrap(sketchMap.get(columns.get(i)).getSketch().toByteArray()),
+                null,


since the sketch will be a single long per column

the sketch should be more than that, small number of kb iirc
Trino uses ZSTD for the blobs, and no compression for the footer.

findepi · 2024-06-14T09:38:30Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                null,
+                ImmutableMap.of()));
+      }
+      writer.finish();


i think you need to finish() to get the final fileSize() etc

findepi · 2024-06-14T09:41:26Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                ImmutableList.of(table.schema().findField(columnName).fieldId()),
+                table.currentSnapshot().snapshotId(),
+                table.currentSnapshot().sequenceNumber(),
+                ByteBuffer.wrap(sketchMap.get(columnName).getSketch().toByteArray()),


i think there should be compact() (perhaps not here, but inside computeNDVSketches

BTW it would be good to have a cross-engine compatibility test to ensure the value we write here can indeed be used correctly by other engines. for trino, you can use https://java.testcontainers.org/modules/databases/trino/

Trino already has such tests, but that doesn't cover Iceberg Spark features that are being implemented.

findepi · 2024-06-14T09:49:32Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    }
+    final byte[] serializedSketchBytes = new byte[length];
+    in.readFully(serializedSketchBytes);
+    sketch = Sketches.wrapSketch(Memory.wrap(serializedSketchBytes));


we wrote a compact sketch, so we can use CompactSketch.wrap here

findepi · 2024-06-14T09:53:33Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                  List<Tuple2<String, String>> columnsList =
+                      Lists.newArrayListWithExpectedSize(columns.size());
+                  for (int i = 0; i < row.size(); i++) {
+                    columnsList.add(new Tuple2<>(columns.get(i), row.get(i).toString()));


this shouldn't use toString
this should use Conversions.toByteBuffer
see https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type

findepi

a lot changed, skimming for now. will have to re-read this

findepi · 2024-06-18T11:07:01Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

@@ -26,4 +29,8 @@ private StandardBlobTypes() {}
   * href="https://datasketches.apache.org/">Apache DataSketches</a> library
   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static Set<String> allStandardBlobTypes() {


Do we still need this method?

findepi · 2024-06-18T11:07:25Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private final Set<String> supportedBlobTypes =
+      ImmutableSet.of(StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1);


Can this be static?

findepi · 2024-06-18T11:08:51Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private Set<String> columns;
+  private Set<String> blobTypesToAnalyze = supportedBlobTypes;


What about combining these into a list of (blob type, columns) pairs?
This might be necessary when we add support for new blob types.
see https://github.com/apache/iceberg/pull/10288/files#r1639547902

findepi · 2024-06-18T11:11:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

  private Long snapshotId;

  AnalyzeTableSparkAction(SparkSession spark, Table table) {
    super(spark);
    this.table = table;
+    Snapshot snapshot = table.currentSnapshot();
+    ValidationException.check(snapshot != null, "Cannot analyze a table that has no snapshots");


It would be nice to handle this case gracefully.
Table without snapshots is an empty table (no data).
Also, stats are assigned to snapshots, and there is no snapshot, so there cannot be a stats file created.
Thus there is only one way to handle this gracefully -- just no-op.
I believe this would be better from user-perspective than just throwing.

findepi · 2024-06-18T11:12:13Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

-                          .type(statsName)
-                          .addAllErrors(Lists.newArrayList("Stats type not supported"))
-                          .build();
+                      throw new UnsupportedOperationException();


In case this exception is thrown (due to some code modifications in the future), it could be helpful to include type in the exception message.

findepi · 2024-06-18T11:13:49Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  public AnalyzeTable snapshot(long snapId) {
+    this.snapshotId = snapId;


nit: snapId -> snapshotId

findepi · 2024-06-18T11:14:24Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

-        Sets.newHashSet(StandardBlobTypes.blobTypes()).containsAll(statisticTypes),
-        "type not supported");
-    this.types = statisticTypes;
+  public AnalyzeTable blobTypes(Set<String> types) {


nit: types -> blobTypes (just like in the interface method declaration)

karuppayya

Thanks @findepi for your review.
I addressed the comments, please take a look when you get a chance.
One pending item is the interoperability test.

karuppayya · 2024-06-14T15:16:10Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param columnNames a set of column names to be analyzed
+   * @return this for method chaining
+   */
+  AnalyzeTable columns(String... columnNames);


I can see few databases that allow collecting all the stats for given columns.
PostgreSQL
ANALYZE table_name (column1, column2);

Oracle
EXEC DBMS_STATS.GATHER_TABLE_STATS('schema_name', 'table_name', 'method_opt' => 'FOR COLUMNS column1, column2');

Also looks like most databases dont allow specifying types of stats to be collected.
Though we take type as input in the API, we can restrict the usage by not exposing it in the procedure?

correlation blob type for {B,C} together

This is the stats on the combined value of B and C. Is my understanding right?
Is this a common usecase since i didnt find databases supporting this by default

karuppayya · 2024-06-14T15:32:41Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param types set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable types(Set<String> types);


I think we should use the blob type in Action and stats type in the Procedure(from where we could map stats to its blob type(s)?)
For eg: if NDV supports 2 blob types and user wants to generate only one of those, that would still be possible from the actions.

karuppayya · 2024-06-18T18:20:56Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

@@ -26,4 +29,8 @@ private StandardBlobTypes() {}
   * href="https://datasketches.apache.org/">Apache DataSketches</a> library
   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static Set<String> allStandardBlobTypes() {


karuppayya · 2024-06-18T18:22:26Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

-        Sets.newHashSet(StandardBlobTypes.blobTypes()).containsAll(statisticTypes),
-        "type not supported");
-    this.types = statisticTypes;
+  public AnalyzeTable blobTypes(Set<String> types) {


karuppayya · 2024-06-18T18:22:33Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  public AnalyzeTable snapshot(long snapId) {
+    this.snapshotId = snapId;


karuppayya · 2024-06-18T18:26:12Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+              return type == Type.TypeID.INTEGER
+                  || type == Type.TypeID.LONG
+                  || type == Type.TypeID.STRING
+                  || type == Type.TypeID.DOUBLE;


These were the datatypes supported by the sketch lib. I think this no longer relevant with Conversions.toByteBuffer

szehon-ho · 2024-06-24T17:51:35Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  AnalyzeTable blobTypes(Set<String> blobTypes);
+
+  /**
+   * id of the snapshot for which stats need to be collected


Nit: capitalize id (at least first character, or even ID works)

szehon-ho · 2024-06-25T21:34:21Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


Good point, snapshot should be fine.

szehon-ho · 2024-06-25T21:38:53Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param columnNames a set of column names to be analyzed
+   * @return this for method chaining
+   */
+  AnalyzeTable columns(String... columnNames);


@findepi Is this resolved? Is there a strong use case to support specification of (type, column) pair?

How about: columnStats(Set types, String ... columns).

If types is not specified, then we can set it to all the supported types.

szehon-ho · 2024-06-25T21:58:27Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+                      try {
+                        return generateNDVBlobs().stream();
+                      } catch (Exception e) {
+                        LOG.error(


Question, it seems simpler to throw an exception, what is the motivation for this error handling.

Since there are can be more than one type of statistic to be collected, I was think about sending the errors with respect to each stat type in the results.
But looks like this is not something very common atleast in RDBMS land.
I changed the Result to not send error message and instead throwing an Exception now

szehon-ho · 2024-06-25T21:58:46Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+                      throw new UnsupportedOperationException(
+                          String.format("%s is not supported", type));
+                  }
+                  return Stream.empty();


This can be moved up to the catch code, as it doesnt make too much sense here.

szehon-ho · 2024-06-25T21:59:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+  }
+
+  private static ThetaSketchJavaSerializable combine(
+      final ThetaSketchJavaSerializable sketch1, final ThetaSketchJavaSerializable sketch2) {


Nit: remove the finals here

szehon-ho · 2024-06-25T21:59:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      return sketch1;
+    }
+
+    final CompactSketch compactSketch1 = sketch1.getCompactSketch();


Nit: remove final

szehon-ho · 2024-06-25T22:02:48Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    return sketch;
+  }
+
+  private static ThetaSketchJavaSerializable combine(


I just realize, why cant we move this logic to ThetaSketchJavaSerializable itself?

Then update/combine can just be oneliners, and probably able to be inlined ie :

colNameAndSketchPair.aggregateByKey( new ThetaSketchJavaSerializable(), ThetaSketchJavaSerializable::update, ThetaSketchJavaSerializable::combine,

szehon-ho

Thanks, looks a lot better , some more comments.

szehon-ho · 2024-06-27T21:26:50Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param columnNames a set of column names to be analyzed
+   * @return this for method chaining
+   */
+  AnalyzeTable columns(String... columnNames);


Nit: columnNames => columns for simplicity, there is also precedent in RewriteDataFiles for example

szehon-ho · 2024-06-27T21:31:02Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    super(spark);
+    this.table = table;
+    Snapshot snapshot = table.currentSnapshot();
+    if (snapshot != null) {


I wonder what you think, would it be simpler to do the error check here? (instead of doExecute)

Preconditions.checkNotNull(snapshot, "Cannot analyze an empty table")

Then this.snapshotToAnalyze can just be a primitive long (no need to worry about nulls?) The argument for the setter is already "long" type.

szehon-ho · 2024-06-27T21:31:32Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    this.table = table;
+    Snapshot snapshot = table.currentSnapshot();
+    if (snapshot != null) {
+      snapshotToAnalyze = snapshot.snapshotId();


Nit: we typically use this when setting member variables, ie this.snapshotToAnalyze

szehon-ho · 2024-06-27T21:31:52Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+
+  private final Table table;
+  private Set<String> columns;
+  private Long snapshotToAnalyze;


Nit: snapshotToAnalyze => snapshotId? (I feel 'toAnalyze' is apparent, also as its not actually Snapshot object). If its to not hide a variable, maybe we can change the other one as this one is used in more places and it would save more unnecessary chars.

szehon-ho · 2024-06-27T21:32:22Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    if (snapshot != null) {
+      snapshotToAnalyze = snapshot.snapshotId();
+    }
+    columns =


Same, this.columns = ...

szehon-ho · 2024-06-27T21:41:34Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    Schema schema = table.schema();
+    List<Types.NestedField> nestedFields =
+        columns.stream().map(schema::findField).collect(Collectors.toList());
+    final JavaPairRDD<String, ByteBuffer> colNameAndSketchPairRDD =


Nit; remove final

szehon-ho · 2024-06-27T21:42:28Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    return (CompactSketch) sketch;
+  }
+
+  void update(final ByteBuffer value) {


can we remove final?

szehon-ho · 2024-06-27T21:42:37Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    return sketch.getEstimate();
+  }
+
+  private void writeObject(final ObjectOutputStream out) throws IOException {


Same remove final (and next method too)

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

szehon-ho · 2024-06-27T21:58:17Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+        .collect(Collectors.toList());
+  }
+
+  static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches(


any reason this method is not private?

how about this method return the Map directly to let the main method be cleaner?

szehon-ho · 2024-07-01T22:58:25Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private Result doExecute() {
+    LOG.info("Starting analysis of {} for snapshot {}", table.name(), snapshotId);
+    List<Blob> blobs =
+        supportedBlobTypes.stream()


this looks a bit silly , as we define the supportedTypes (as we decided not configurable in the beginning) and are just checking whether it is APACHE_DATASKETCHES_THETA_V1, should we just simplify it?

szehon-ho · 2024-07-02T00:38:23Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    return columns.stream()
+        .map(
+            columnName -> {
+              Sketch sketch = sketchMap.get(columnName).getSketch();


Actually i just realize this, why do we need the previous method to pass in Map of column name? It seems we always use column id, we can simplify the collect spark job if so?

I think it makes sense to generate a map between ids and column names.
As for the previous method passing columns name,
we need it since we need to compose the dataframe to select columns based on columns names

Dataset<Row> data = spark .read() .option(SparkReadOptions.SNAPSHOT_ID, snapshotId) .table(tableName) .select(columns.stream().map(functions::col).toArray(Column[]::new));

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

szehon-ho · 2024-07-02T00:42:10Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    Table table = Spark3Util.loadIcebergTable(spark, tableName);
+    SparkActions actions = SparkActions.get();
+    AnalyzeTable.Result results = actions.analyzeTable(table).columns("id", "data").execute();
+    actions.analyzeTable(table).columns("id", "data").execute();


option: should we switch this to just one column of the table? (as the other one already calls with two columns, albeit implicitly)

The second invocaton is not needed, added by mistake.
Removed now

szehon-ho · 2024-07-02T00:42:29Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+
+    Assertions.assertEquals(1, table.statisticsFiles().size());
+    Assertions.assertEquals(2, table.statisticsFiles().get(0).blobMetadata().size());
+    assertNotEquals(0, table.statisticsFiles().get(0).fileSizeInBytes());


Nit: doesnt hurt to also assert the ndv is collected here?

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

szehon-ho · 2024-07-02T20:23:53Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    Snapshot snapshot = table.currentSnapshot();
+    if (snapshot == null) {
+      LOG.error("Unable to analyze the table since the table has no snapshots");
+      throw new RuntimeException("Snapshot id is null");


Nit: update exception message to more reflect error? (like the log message?)

szehon-ho · 2024-07-02T20:28:17Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+
+  static ThetaSketchJavaSerializable combineSketch(
+      ThetaSketchJavaSerializable sketch1, ThetaSketchJavaSerializable sketch2) {
+    ThetaSketchJavaSerializable emptySketchWrapped =


Nit: can we just inline this in the return value? It doesnt seem to be used elsewhere.

szehon-ho

Sorry! final few comments on javadoc and small consistency nit

szehon-ho · 2024-07-02T21:25:02Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+/** An action that collects statistics of an Iceberg table and writes to Puffin files. */
+public interface AnalyzeTable extends Action<AnalyzeTable, AnalyzeTable.Result> {
+  /**
+   * The set of columns to be analyzed


Choose the set of columns to be analyzed, by default all columns are analyzed.

szehon-ho · 2024-07-02T21:25:17Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  AnalyzeTable columns(String... columns);
+
+  /**
+   * Id of the snapshot for which stats need to be collected


Choose the table snapshot to analyze, by default the current snapshot is analyzed.

szehon-ho · 2024-07-02T21:25:35Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  /**
+   * Id of the snapshot for which stats need to be collected
+   *
+   * @param snapshotId long id of the snapshot for which stats need to be collected


'to be collected' => 'analyzed' to be consistent with previous javadoc?

szehon-ho · 2024-07-02T21:25:56Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   */
+  AnalyzeTable snapshot(long snapshotId);
+
+  /** The action result that contains summaries of the Analysis. */


Analysis can be lower case, as its not a class object.

szehon-ho · 2024-07-02T21:27:10Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+  }
+
+  private static Map<Integer, ThetaSketchJavaSerializable> computeNDVSketches(
+      SparkSession spark, Table table, long snapshotId, Set<String> toBeAnalyzedColumns) {


Nit: columnsToBeAnalyzed to be consistent with above method

szehon-ho · 2024-07-02T21:27:58Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    RuntimeException exception =
+        assertThrows(
+            RuntimeException.class, () -> actions.analyzeTable(table).columns("id").execute());
+    assertTrue(exception.getMessage().contains("Snapshot id is null"));


Probably need to change this message?

aokolnychyi · 2024-07-02T23:41:05Z

I'll have some time to take a look this week.

aokolnychyi · 2024-07-03T01:30:42Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

@@ -70,4 +70,10 @@ default RewritePositionDeleteFiles rewritePositionDeletes(Table table) {
    throw new UnsupportedOperationException(
        this.getClass().getName() + " does not implement rewritePositionDeletes");
  }
+
+  /** Instantiates an action to analyze tables */
+  default AnalyzeTable analyzeTable(Table table) {


Question: Have we considered other names like computeTableStats or refreshTableStats to be a bit more specific? What naming is used in other engines? I'd be curious to hear from everyone.

+1, I have a same concern about this generic name

#10288 (comment)

I understand there is ANALYZE command in Spark but I wonder how to handle partition stats in the future (e.g. like whether they should be computed as part of analyze or separately).

Trino and Spark seems to be collecting stats specific partition level stats as part of the Analyze command grammar. (Or atleast they dont have a separate command to collect partition stats)
So went with the name, but open to changing it.

Maybe by default collect all the stats and have option to compute specific stats like distinct count and partition stats by specifying individually.

aokolnychyi · 2024-07-03T01:30:56Z

api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java

@@ -70,4 +70,10 @@ default RewritePositionDeleteFiles rewritePositionDeletes(Table table) {
    throw new UnsupportedOperationException(
        this.getClass().getName() + " does not implement rewritePositionDeletes");
  }
+
+  /** Instantiates an action to analyze tables */


Minor: Missing . at the end?

aokolnychyi · 2024-07-03T01:33:35Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  interface Result {
+
+    /** Returns statistics file. */
+    StatisticsFile statisticFile();


I think we are missing s here: statisticFile() -> statisticsFile().

aokolnychyi · 2024-07-03T01:34:12Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  AnalyzeTable columns(String... columns);
+
+  /**
+   * Id of the snapshot for which stats need to be collected


Minor: id -> ID everywhere.

aokolnychyi · 2024-07-03T01:36:12Z

spark/v3.5/build.gradle

@@ -59,6 +59,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
    implementation project(':iceberg-parquet')
    implementation project(':iceberg-arrow')
    implementation("org.scala-lang.modules:scala-collection-compat_${scalaVersion}:${libs.versions.scala.collection.compat.get()}")
+    implementation("org.apache.datasketches:datasketches-java:${libs.versions.datasketches.get()}")


Does Spark by any chance ship this? Do we have to worry about conflicts?

Thanks for catching this. Looks like sql/catalyst uses the same library.
Should we shade this here and pin the version?

I have shaded the library

aokolnychyi · 2024-07-03T01:45:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    super(spark);
+    this.table = table;
+    Snapshot snapshot = table.currentSnapshot();
+    if (snapshot == null) {


Hm, shouldn't we simply gracefully return in this case? Why throw an exception?

Makes sense,
This action current returns the statistic file as output.
In this case of a table with no snapshots, should we return null for results?

github-actions bot added API spark core build labels May 8, 2024

ajantha-bhat reviewed May 8, 2024

View reviewed changes

ajantha-bhat mentioned this pull request May 15, 2024

Add a Spark procedure to collect NDV #6582

Open

amogh-jahagirdar reviewed May 29, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch 3 times, most recently from 5538f6e to de520fc Compare June 4, 2024 17:55

szehon-ho reviewed Jun 6, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from de520fc to 58d22d6 Compare June 7, 2024 19:37

szehon-ho reviewed Jun 13, 2024

View reviewed changes

findepi reviewed Jun 14, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch 2 times, most recently from c189b28 to 7b2cbce Compare June 17, 2024 23:41

findepi reviewed Jun 18, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from 7b2cbce to deee51c Compare June 18, 2024 18:57

karuppayya commented Jun 18, 2024

View reviewed changes

szehon-ho reviewed Jun 25, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from deee51c to 43743d7 Compare June 27, 2024 19:03

szehon-ho reviewed Jun 27, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from 43743d7 to fa2f4b9 Compare June 28, 2024 15:32

szehon-ho reviewed Jul 2, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from fa2f4b9 to e4ff70c Compare July 2, 2024 18:55

szehon-ho reviewed Jul 2, 2024

View reviewed changes

aokolnychyi reviewed Jul 3, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch from 7525f36 to cc90599 Compare July 3, 2024 17:47

amogh-jahagirdar mentioned this pull request Jul 5, 2024

Spec: Make NDV blob metadata property required #10549

Open

krajendran4 added 13 commits July 9, 2024 12:38

core +api changes

d798ec6

Analyze table Spark action

66e17e3

Address review comments

d30bfa8

Address review comments + some improvements

231e15c

Address review comemnts

a9fb707

Address review comments

0cf96cb

Address review comments

c8b885c

Address revewi comments

3a5a2c2

Address review comments

f8ac40a

Address review comments

704e555

Address review comments

be5a1d1

Address review comments

3d1bdc2

Shade datasketch lib

9013790

karuppayya force-pushed the analyze_action branch from cc90599 to 9013790 Compare July 9, 2024 20:06

		private final Set<String> supportedBlobTypes =
		ImmutableSet.of(StandardBlobTypes.APACHE_DATASKETCHES_THETA_V1);

		private Set<String> columns;
		private Set<String> blobTypesToAnalyze = supportedBlobTypes;

		public AnalyzeTable snapshot(long snapId) {
		this.snapshotId = snapId;

Spark Action to Analyze table #10288

Are you sure you want to change the base?

Spark Action to Analyze table #10288

Conversation

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024 •

edited

Loading

szehon-ho Jun 6, 2024 •

edited

Loading

szehon-ho Jun 13, 2024 •

edited

Loading

szehon-ho Jun 13, 2024 •

edited

Loading

szehon-ho Jun 13, 2024 •

edited

Loading