Integrating Vector with ClickHouse
Being able to analyze your logs in real time is critical for production applications. ClickHouse excels at storing and analyzing log data due to it's excellent compression (up to 170x for logs) and ability to aggregate large amounts of data quickly.
This guide shows you how to use the popular data pipeline Vector to tail an Nginx log file and send it to ClickHouse. The steps below are similar for tailing any type of log file.
Prerequisites:
- You already have ClickHouse up and running
- You have Vector installed
Create a database and table
Define a table to store the log events:
- Begin with a new database named
nginxdb:
- Insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using materialized views.
ORDER BY is set to tuple() (an empty tuple) as there is no need for a primary key yet.
Configure Nginx
In this step, you will be shown how to get Nginx logging configured.
- The following
access_logproperty sends logs to/var/log/nginx/my_access.login the combined format. This value goes in thehttpsection of yournginx.conffile:
-
Be sure to restart Nginx if you had to modify
nginx.conf. -
Generate some log events in the access log by visiting pages on your web server. Logs in the combined format look as follows:
Configure Vector
Vector collects, transforms and routes logs, metrics, and traces (referred to as sources) to many different vendors (referred to as sinks), including out-of-the-box compatibility with ClickHouse. Sources and sinks are defined in a configuration file named vector.toml.
- The following vector.toml file defines a source of type file that tails the end of my_access.log, and it also defines a sink as the access_logs table defined above:
-
Start Vector using the configuration above. Visit the Vector documentation for more details on defining sources and sinks.
-
Verify that the access logs are being inserted into ClickHouse by running the following query. You should see the access logs in your table:
Parse the Logs
Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. We'll next look at how to parse the log events using a materialized view.
A materialized view functions similarly to an insert trigger in SQL. When rows of data are inserted into a source table, the materialized view makes some transformation of these rows and inserts the results into a target table. The materialized view can be configured to configure a parsed representation of the log events in access_logs. An example of one such log event is shown below:
There are various functions in ClickHouse to parse the above string. The splitByWhitespace function parses a string by whitespace and returns each token in an array.
To demonstrate, run the following command:
A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but the resulting array is close to what is needed.
Similar to splitByWhitespace, the splitByRegexp function splits a string into an array based on a regular expression.
Run the following command, which returns two strings.
Notice that the second string returned is the user agent successfully parsed from the log:
Before looking at the final CREATE MATERIALIZED VIEW command, let's view a couple more functions used to clean up the data.
For example, the value of RequestMethod is "GET containing an unwanted double-quote.
You can use the trim function to remove the double quote:
The time string has a leading square bracket, and is also not in a format that ClickHouse can parse into a date. However, if we change the separator from a colon (:) to a comma (,) then the parsing works great:
We are now ready to define the materialized view.
The definition below includes POPULATE, which means the existing rows in access_logs will be processed and inserted right away.
Run the following SQL statement:
Now verify it worked. You should see the access logs nicely parsed into columns:
The lesson above stored the data in two tables, but you could change the initial nginxdb.access_logs table to use the Null table engine.
The parsed data will still end up in the nginxdb.access_logs_view table, but the raw data will not be stored in a table.
By using Vector, which only requires a simple install and quick configuration, you can send logs from an Nginx server to a table in ClickHouse. By using a materialized view, you can parse those logs into columns for easier analytics.