Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replicated_deduplication_window=0 setting produces duplicates in DBT #216

Closed
Timev opened this issue Dec 4, 2023 · 1 comment · Fixed by #217
Closed

replicated_deduplication_window=0 setting produces duplicates in DBT #216

Timev opened this issue Dec 4, 2023 · 1 comment · Fixed by #217
Labels
bug Something isn't working

Comments

@Timev
Copy link

Timev commented Dec 4, 2023

Describe the bug

Adding "replicated_deduplication_window=0" setting creates duplicates in DBT entities with a ReplicatedMergeTree engine. The table is not incremental, and I've added allow_automatic_deduplication: "True" to profiles.yml but it doesn't help.
In case I'm launching the same DBT-generated script for table creation manually with SETTINGS replicated_deduplication_window=0 removed - it works fine returning 2 records from table as a result

Steps to reproduce

  1. Create test model "test_dbt_model"
with source_data as (
    select 1 as id
    union all
    select null as id
)
select *
from source_data
  1. Set it's config:
engine: ReplicatedMergeTree()
materialized: table
  1. Run dbt model
    dbt run -s test_dbt_model
  2. The final result shows 4 records inserted, 2 pairs of similar records:
default> select * 
         from dbt.test_dbt_model
         limit 1000
[2023-12-04 22:30:59] 4 rows retrieved starting from 1 in 110 ms (execution: 100 ms, fetching: 10 ms)

Expected behaviour

DBT creates a test_dbt_model table with 2 rows: id = 1 and id is null

Code examples, such as models or profile settings

  1. dbt_project.yml
models:
  batch_etl:
    example:
      +materialized: table
      +engine: ReplicatedMergeTree()
  1. profiles.yml
    prod:
      type: clickhouse
      schema: dbt
      host: "{{ env_var('WH_DEV_HOST') }}"
      port: 8123
      user: "{{ env_var('WH_USER') }}"
      password: "{{ env_var('WH_PW') }}"
      cluster: "{{ env_var('CH_CLUSTER_NAME') }}"
      check_exchange: False
      secure: False
      allow_automatic_deduplication: True

dbt and/or ClickHouse server logs

Generated SQL on insert:

Configuration

Environment

  • dbt version: 1.6.9
  • dbt-clickhouse version: 1.6.0
  • clickhouse-driver version (if using native)
  • clickhouse-connect version (if using http):
  • Python version: 3.8.10
  • Operating system: Windows 10

ClickHouse server

  • ClickHouse Server version: 22.3.15.33
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
        create table dbt.test_dbt_model__dbt_backup
        
  
    
    ON CLUSTER "test_v2" 
        
  engine = ReplicatedMergeTree()
        
      order by (tuple())
        
        
        SETTINGS  replicated_deduplication_window=0

          as (
            with source_data as (
    select 1 as id
    union all
    select null as id
)
select *
from source_data
          )
@Timev Timev added the bug Something isn't working label Dec 4, 2023
@genzgd
Copy link
Contributor

genzgd commented Dec 4, 2023

Thanks for including complete debugging information with this report!

Before ClickHouse v 23.7, dbt does a CREATE TABLE AS SELECT ... and inserts data at the same time as table creation. Since this gets executed ON CLUSTER, with ClickHouse deduplication off each CREATE TABLE statement inserts the same data. After ClickHouse v 23.7, dbt does a CREATE TABLE AS SELECT ... EMPTY and then executes the INSERT INTO statement directly, without an ON CLUSTER clause.

So it's pretty clear that allow_automatic_deduplication: True is required for older ClickHouse versions, and in the next release we'll make that the default if an older version is detected.

Unfortunately there was a subtle bug in the profile handling where allow_automatic_deduplication: True was not actually working, and this will hopefully be fixed in 1.6.1 as well.

@genzgd genzgd mentioned this issue Dec 5, 2023
2 tasks
@genzgd genzgd linked a pull request Dec 5, 2023 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants