Announcement

👇Official Account👇

图片

Welcome to join the group & private message

Article first/tail QR code

Skip to content

ClickHouse 实战:从入门到高性能 OLAP 查询

ClickHouse 是俄罗斯 Yandex 开源的列式数据库,专为 OLAP 场景设计,查询性能比传统数据库快 100 倍以上。

一、ClickHouse 基础

1.1 什么是 ClickHouse

ClickHouse 是一个用于联机分析处理(OLAP)的列式数据库管理系统(DBMS)。

核心特点:

  • 列式存储
  • 向量化查询执行
  • 数据压缩
  • 并行处理
  • 实时数据插入

1.2 适用场景

场景说明示例
日志分析海量日志存储和查询Nginx 日志、应用日志
时序数据时间序列数据存储监控指标、传感器数据
数据分析大数据分析用户行为分析、BI 报表
实时报表实时数据看板实时 UV/PV 统计

二、核心概念

2.1 列式存储 vs 行式存储

行式存储(MySQL):
┌────┬──────┬─────┬────────┐
│ ID │ Name │ Age │ City   │
├────┼──────┼─────┼────────┤
│ 1  │ Alice│ 25  │ Beijing│  <- 整行读取
│ 2  │ Bob  │ 30  │ Shanghai│
└────┴──────┴─────┴────────┘

列式存储(ClickHouse):
ID:    [1, 2, 3, 4, 5]      <- 只读取需要的列
Name:  [Alice, Bob, Charlie, ...]
Age:   [25, 30, 35, ...]
City:  [Beijing, Shanghai, ...]

优势:

  • 查询只需读取相关列
  • 更好的数据压缩
  • 向量化查询执行

2.2 表引擎

sql
-- MergeTree 引擎(最常用)
CREATE TABLE events (
    event_date Date,
    user_id UInt64,
    event_type String,
    value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

-- ReplacingMergeTree(去重)
CREATE TABLE user_states (
    user_id UInt64,
    state String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

-- SummingMergeTree(聚合)
CREATE TABLE orders_summary (
    order_date Date,
    category String,
    amount Float64,
    count UInt64
) ENGINE = SummingMergeTree()
ORDER BY (order_date, category);

三、Go 操作 ClickHouse

3.1 连接配置

go
import "github.com/ClickHouse/clickhouse-go/v2"

func connect() (driver.Conn, error) {
    conn, err := clickhouse.Open(&clickhouse.Options{
        Addr: []string{"127.0.0.1:9000"},
        Auth: clickhouse.Auth{
            Database: "default",
            Username: "default",
            Password: "",
        },
        Settings: clickhouse.Settings{
            "max_execution_time": 60,
        },
    })
    
    return conn, err
}

3.2 数据写入

go
func batchInsert(conn driver.Conn, events []Event) error {
    batch, err := conn.PrepareBatch(context.Background(), "INSERT INTO events")
    if err != nil {
        return err
    }
    
    for _, event := range events {
        err := batch.Append(
            event.EventDate,
            event.UserID,
            event.EventType,
            event.Value,
        )
        if err != nil {
            return err
        }
    }
    
    return batch.Send()
}

3.3 数据查询

go
func queryEvents(conn driver.Conn, startDate, endDate time.Time) ([]Event, error) {
    rows, err := conn.Query(context.Background(), `
        SELECT event_date, user_id, event_type, value
        FROM events
        WHERE event_date BETWEEN ? AND ?
        ORDER BY event_date DESC
        LIMIT 1000
    `, startDate, endDate)
    
    if err != nil {
        return nil, err
    }
    defer rows.Close()
    
    var events []Event
    for rows.Next() {
        var event Event
        if err := rows.Scan(
            &event.EventDate,
            &event.UserID,
            &event.EventType,
            &event.Value,
        ); err != nil {
            return nil, err
        }
        events = append(events, event)
    }
    
    return events, nil
}

四、性能优化

4.1 分区策略

sql
-- 按时间分区(推荐)
PARTITION BY toYYYYMM(event_date)

-- 按日期分区
PARTITION BY toDate(event_time)

-- 按类别分区
PARTITION BY category

4.2 索引优化

sql
-- 主键索引
ORDER BY (event_date, user_id)

-- 跳数索引
CREATE TABLE events (
    event_date Date,
    user_id UInt64,
    event_type String,
    value Float64,
    INDEX idx_value value TYPE minmax GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY (event_date, user_id);

4.3 数据类型选择

类型说明示例
UInt8/16/32/64无符号整数ID、计数
Int8/16/32/64有符号整数温度、差值
Float32/64浮点数金额、指标
String字符串名称、描述
Date/DateTime日期时间时间戳
Enum8/16枚举状态、类型

五、实战案例

5.1 日志分析系统

sql
-- 创建日志表
CREATE TABLE nginx_logs (
    timestamp DateTime,
    client_ip IPv4,
    request_method String,
    request_uri String,
    status_code UInt16,
    response_size UInt64,
    response_time Float64,
    user_agent String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, client_ip);

-- 查询慢请求
SELECT 
    request_uri,
    count() as request_count,
    avg(response_time) as avg_time,
    max(response_time) as max_time
FROM nginx_logs
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY request_uri
HAVING avg_time > 1.0
ORDER BY avg_time DESC
LIMIT 100;

5.2 用户行为分析

sql
-- 创建用户行为表
CREATE TABLE user_events (
    event_time DateTime,
    user_id UInt64,
    event_type String,
    page_url String,
    session_id String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_time, user_id);

-- 实时 UV 统计
SELECT 
    toStartOfHour(event_time) as hour,
    uniqExact(user_id) as uv,
    count() as pv
FROM user_events
WHERE event_time >= today()
GROUP BY hour
ORDER BY hour;

六、总结

优化点策略
分区按时间或类别分区
索引合理设计主键和跳数索引
类型选择合适的数据类型
写入批量写入
查询避免 SELECT *

ClickHouse 是 OLAP 场景的最佳选择,配合 Go 语言能构建高性能数据分析系统。

上次更新于: