- Quick Example
- Secondary Index Table
- Loading Data
- Querying Data
- Compaction
- DDLs on Secondary Index
- Complex DataType support on SI
Start spark-sql in terminal and run the following queries,
CREATE TABLE maintable(a int, b string, c string) stored as carbondata;
insert into maintable select 1, 'ab', 'cd';
CREATE index index1 on table maintable(c) AS 'carbondata';
SELECT a from maintable where c = 'cd';
// NOTE: run explain query and check if query hits the SI table from the plan
EXPLAIN SELECT a from maintable where c = 'cd';
Secondary index tables are created as indexes and managed as child tables internally by Carbondata. Users can create a secondary index based on the column position in the main table(Recommended for right columns) and the queries should have filter on that column to improve the filter query performance.
Data refresh to the secondary index is always automatic. Once SI table is created, Carbondata's
CarbonOptimizer with the help of CarbonSITransformationRule
, transforms the query plan to hit the
SI table based on the filter condition or set of filter conditions present in the query.
So the first level of pruning will be done on the SI table as it stores blocklets and main table/parent
table pruning will be based on the SI output, which helps in giving the faster query results with
better pruning.
Secondary Index table can be created with the below syntax
CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE maintable(index_column)
AS
'carbondata'
[PROPERTIES('table_blocksize'='1')]
NOTE: Keywords given inside
[]
is optional.
For instance, main table called sales which is defined as
CREATE TABLE sales (
order_time timestamp,
user_id string,
sex string,
country string,
quantity int,
price bigint)
STORED AS carbondata
User can create SI table using the Create Index DDL
CREATE INDEX index_sales
ON TABLE sales(user_id)
AS
'carbondata'
PROPERTIES('table_blocksize'='1')
NOTE:
- supported properties are table_blocksize, column_meta_cache, cache_level, carbon.column.compressor, sort_scope and global_sort_partitions.
When a user executes a filter query, during the query planning phase, CarbonData with the help of
CarbonSITransformationRule
, checks if there are any index tables present on the filter column of
query. If there are any, then the filter query plan will be transformed in such a way that execution will
first hit the corresponding SI table and give input to the main table for further pruning.
For the main table sales and SI table index_sales created above, following queries
SELECT country, sex from sales where user_id = 'xxx'
SELECT country, sex from sales where user_id = 'xxx' and country = 'INDIA'
will be transformed by CarbonData's CarbonSITransformationRule
to query against SI table
index_sales first which will be input to the main table sales
case1: When the SI table is created and the main table does not have any data. In this case every consecutive load to the main table, will load data to the SI table once the main table data load is finished.
case2: When the SI table is created and the main table already contains some data, then SI creation will also load data to the SI table with the same number of segments as the main table. Thereafter, consecutive load to the main table will also load data to the SI table.
NOTE:
- In case of data load failure to the SI table, segments of main table and SI table will not be in sync. In this scenario the subsequent query pruning will be performed at segment level. It means all successful SI segments will be pruned with SI table and rest of the segment pruning will be performed with help of main table. The subsequent main table load will load the old failed loads along with current load and makes the SI and main table segments in sync.
Direct query can be made on SI tables to check the data present in position reference columns. When a filter query is fired, and if the filter column is a secondary index column, then plan is transformed accordingly to hit the SI table first to make better pruning with the main table and in turn helps for faster query results.
Users can verify whether a query can leverage the SI table or not by executing the EXPLAIN
command, which will show the transformed logical plan, and thus users can check whether the SI table
is selected.
Running Compaction command (ALTER TABLE COMPACT
)[COMPACTION TYPE-> MINOR/MAJOR] on main table will
automatically delete all the old segments of SI and creates a new segment with same name as main
table compacted segment and loads data to it.
Where there are so many small files present in the SI table, then we can use the REFRESH INDEX command to compact the files within an SI segment to avoid many small files.
REFRESH INDEX sales_index ON TABLE sales
This command merges data files in each segment of the SI table.
REFRESH INDEX sales_index ON TABLE sales WHERE SEGMENT.ID IN(1)
This command merges data files within a specified segment of the SI table.
When Secondary indexes are created on a table(s), data fetching happens from secondary indexes created on the main tables for better performance. But sometimes, data fetching from the secondary index might degrade query performance in cases where the data is sparse and most of the blocklets need to be scanned. So to avoid such secondary indexes, we use NI as a function on filters within WHERE clause.
SELECT country, sex from sales where NI(user_id = 'xxx')
The above query ignores column user_id
from the secondary index and fetches data from the main table.
This command is used to get information about all the secondary indexes on a table.
Syntax
SHOW INDEXES ON [TABLE] [db_name.]table_name
This command is used to drop an existing secondary index on a table
Syntax
DROP INDEX [IF EXISTS] index_name ON [TABLE] [db_name.]table_name
This command registers the secondary index with the main table in case of compatibility scenarios where we have old stores.
Syntax
REGISTER INDEX TABLE index_name ON [TABLE] [db_name.]table_name
This command is used to reload segments in the SI table in case when there is some mismatch in the number of segments with main table.
Syntax
Reindex on all the secondary Indexes of the main table
REINDEX ON TABLE [db_name.]main_table_name [WHERE SEGMENT.ID IN(0,1)]
Reindexing at index table level
REINDEX INDEX TABLE index_table ON [db_name.]main_table_name [WHERE SEGMENT.ID IN (1)]
Reindex on Database level
REINDEX DATABASE db_name [WHERE SEGMENT.ID IN (1,2,5)]
Note: This command is not supported with other concurrent operations.
Currently, only complex Array types are supported for creating secondary indexes. Nested Array support and other complex types support will be supported in the future.
Secondary Indexes are used in main table query pruning by rewriting the Spark plan during course of query execution. It is not possible to use Secondary Indexes in the query pruning when the query is fired from engines other than Spark (i.e., Presto, Hive etc). To address this issue, Secondary Index is implemented as a Coarse Grain Index similar to Bloom. Coarse Grain Index pruning happens right after default pruning of the table being queried. A new property is introduced at session level and carbon level to support Secondary Index as a Coarse Grain Index. It can be configured in two variants as show below:
- Configure globally -
carbon.coarse.grain.secondary.index
- Configure for a particular table -
carbon.coarse.grain.secondary.index.<dbname>.<tablename>
Property when specified along with database name and table name ensures that Secondary Index as Coarse Grain Index can be enabled/disable for query pruning on a particular table. It has higher precedence over global configuration.
The default value of the property is false
for the Spark session. By default, Spark
queries continue to use Secondary Indexes in the query pruning via spark query plan rewrite. If
user want to use Secondary Index as a Coarse Grain Index in spark query pruning, need to explicitly
configure the property to true
. Setting the configuration to true
also avoids the
query plan rewrite. Since the feature is newly introduced, unless existing queries are working with
Coarse Grain Secondary Index, and the performance improvement is evident, It is recommended to
avoid using this property for the existing customers using the Secondary Indexes with Spark queries.
The default value of this property is true
for Presto. By default, Presto queries use
Secondary Index as a Coarse Grain Index in spark query pruning. If user do not wish to use the
Secondary Indexes in the query pruning, need to explicitly configure this property to false
.
Note: This feature is not supported with the Secondary Index tables created on the older versions.