Connect dlt to ClickHouse
dlt is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.
Install dlt with ClickHouse
To Install the dlt library with ClickHouse dependencies:
Setup guide
Initialize the dlt Project
Start by initializing a new dlt project as follows:
This command will initialize your pipeline with chess as the source and ClickHouse as the destination.
The above command generates several files and directories, including .dlt/secrets.toml and a requirements file for ClickHouse. You can install the necessary dependencies specified in the requirements file by executing it as follows:
or with pip install dlt[clickhouse], which installs the dlt library and the necessary dependencies for working with ClickHouse as a destination.
Setup ClickHouse Database
To load data into ClickHouse, you need to create a ClickHouse database. Here's a rough outline of what should you do:
- 
You can use an existing ClickHouse database or create a new one. 
- 
To create a new database, connect to your ClickHouse server using the clickhouse-clientcommand line tool or a SQL client of your choice.
- 
Run the following SQL commands to create a new database, user and grant the necessary permissions: 
Add credentials
Next, set up the ClickHouse credentials in the .dlt/secrets.toml file as shown below:
The http_port parameter specifies the port number to use when connecting to the ClickHouse server's HTTP interface. This is different from default port 9000, which is used for the native TCP protocol.
You must set http_port if you are not using external staging (i.e. you don't set the staging parameter in your pipeline). This is because the built-in ClickHouse local storage staging uses the clickhouse content library, which communicates with ClickHouse over HTTP.
Make sure your ClickHouse server is configured to accept HTTP connections on the port specified by http_port. For example, if you set http_port = 8443, then ClickHouse should be listening for HTTP requests on port 8443. If you are using external staging, you can omit the http_port parameter, since clickhouse-connect will not be used in this case.
You can pass a database connection string similar to the one used by the clickhouse-driver library. The credentials above will look like this:
Write disposition
All write dispositions are supported.
Write dispositions in the dlt library define how the data should be written to the destination. There are three types of write dispositions:
Replace: This disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data. You can learn more about it here.
Merge: This write disposition merges the data from the resource with the data at the destination. For merge disposition, you would need to specify a primary_key for the resource. You can learn more about it here.
Append: This is the default disposition. It will append the data to the existing data in the destination, ignoring the primary_key field.
Data loading
Data is loaded into ClickHouse using the most efficient method depending on the data source:
- For local files, the clickhouse-connectlibrary is used to directly load files into ClickHouse tables using theINSERTcommand.
- For files in remote storage like S3,Google Cloud Storage, orAzure Blob Storage, ClickHouse table functions like s3, gcs and azureBlobStorage are used to read the files and insert the data into tables.
Datasets
Clickhouse does not support multiple datasets in one database, whereas dlt relies on datasets due to multiple reasons. In order to make Clickhouse work with dlt, tables generated by dlt in your Clickhouse database will have their names prefixed with the dataset name, separated by the configurable dataset_table_separator. Additionally, a special sentinel table that does not contain any data will be created, allowing dlt to recognize which virtual datasets already exist in a Clickhouse destination.
Supported file formats
- jsonl is the preferred format for both direct loading and staging.
- parquet is supported for both direct loading and staging.
The clickhouse destination has a few specific deviations from the default sql destinations:
- Clickhousehas an experimental- objectdatatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to a text column. If you need this feature, get in touch with our Slack community, and we will consider adding it.
- Clickhousedoes not support the- timedatatype. Time will be loaded to a- textcolumn.
- Clickhousedoes not support the- binarydatatype. Instead, binary data will be loaded into a- textcolumn. When loading from- jsonl, the binary data will be a base64 string, and when loading from parquet, the- binaryobject will be converted to- text.
- Clickhouseaccepts adding columns to a populated table that are not null.
- Clickhousecan produce rounding errors under certain conditions when using the float or double datatype. If you cannot afford to have rounding errors, make sure to use the decimal datatype. For example, loading the value 12.7001 into a double column with the loader file format set to- jsonlwill predictably produce a rounding error.
Supported column hints
ClickHouse supports the following column hints:
- primary_key- marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
Table engine
By default, tables are created using the ReplicatedMergeTree table engine in ClickHouse. You can specify an alternate table engine using the table_engine_type with the clickhouse adapter:
Supported values are:
- merge_tree- creates tables using the- MergeTreeengine
- replicated_merge_tree(default) - creates tables using the- ReplicatedMergeTreeengine
Staging support
ClickHouse supports Amazon S3, Google Cloud Storage and Azure Blob Storage as file staging destinations.
dlt will upload Parquet or jsonl files to the staging location and use ClickHouse table functions to load the data directly from the staged files.
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations:
To run a pipeline with staging enabled:
Using Google Cloud Storage as a staging area
dlt supports using Google Cloud Storage (GCS) as a staging area when loading data into ClickHouse. This is handled automatically by ClickHouse's GCS table function which dlt uses under the hood.
The clickhouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys. To enable this, GCS provides an S3 compatibility mode that emulates the Amazon S3 API. ClickHouse takes advantage of this to allow accessing GCS buckets via its S3 integration.
To set up GCS staging with HMAC authentication in dlt:
- 
Create HMAC keys for your GCS service account by following the Google Cloud guide. 
- 
Configure the HMAC keys as well as the client_email,project_idandprivate_keyfor your service account in your dlt project's ClickHouse destination settings inconfig.toml:
Note: In addition to the HMAC keys bashgcp_access_key_id and gcp_secret_access_key), you now need to provide the client_email, project_id and private_key for your service account under [destination.filesystem.credentials]. This is because the GCS staging support is now implemented as a temporary workaround and is still unoptimized.
dlt will pass these credentials to ClickHouse which will handle the authentication and GCS access.
There is active work in progress to simplify and improve the GCS staging setup for the ClickHouse dlt destination in the future. Proper GCS staging support is being tracked in these GitHub issues:
- Make filesystem destination work with gcs in s3 compatibility mode
- Google Cloud Storage staging area support
Dbt support
Integration with dbt is generally supported via dbt-clickhouse.
Syncing of dlt state
This destination fully supports dlt state sync.
