Elasticsearch一图流
Lucene 核心数据结构 (单机视角) Lucene 是如何通过倒排索引实现快速搜索,以及如何解决排序和存储问题的 graph TD subgraph Lucene_Segment ["Lucene Segment (最小索引单元/不可变)"] direction TB subgraph Inverted_Index ["倒排索引 (Inverted Index) - 用于搜索"] TI["Term Index (内存)"] TD["Term Dictionary (磁盘)"] PL["Posting List (磁盘)"] TI --"前缀索引/加速定位"--> TD TD --"关联文档ID"--> PL note1["> 举例: 查找 '小白'<br/>1. Term Index 定位大概位置<br/>2. Dictionary 找到 '小白'<br/>3. Posting List 得到 ID: [0, 1]"] style TI fill:#e1f5fe,stroke:#01579b style TD fill:#fff9c4,stroke:#fbc02d style PL fill:#fff9c4,stroke:#fbc02d end subgraph Storage ["数据存储"] SF["Stored Fields (行式存储)"] DV["Doc Values (列式存储)"] SF --"根据ID获取完整内容"--> Content["文档原始JSON"] DV --"用于排序/聚合"--> Sort["排序/聚合操作"] note2["> Stored Fields: 存完整数据<br/>> Doc Values: 空间换时间,优化排序"] style SF fill:#e8f5e9,stroke:#2e7d32 style DV fill:#f3e5f5,stroke:#7b1fa2 end end Query["搜索请求"] --> TI PL -.->|"得到文档ID"| Storage Elasticsearch 在Lucene基础上的分布式架构 架构图 graph Client["客户端应用"] --> Coord subgraph Cluster["Elasticsearch Cluster (集群)"] subgraph CoordLayer["协调节点层 (请求入口/分发/聚合)"] Coord["协调节点 (Coordinating Node)"] end subgraph MasterLayer["Master 节点层 (集群管理/选主/元数据)"] Master1["Master Node A"] Master2["Master Node B"] Master3["Master Node C"] end subgraph DataLayer["Data 节点层 (存储/搜索/写入)"] Data1["Data Node 1"] Data2["Data Node 2"] Data3["Data Node 3"] end subgraph IndexLayer["Index: news_index (逻辑索引)"] subgraph Shard0["Shard 0"] P0["Primary Shard 0"] R0["Replica Shard 0"] end subgraph Shard1["Shard 1"] P1["Primary Shard 1"] R1["Replica Shard 1"] end end subgraph LuceneLayer["Lucene 层 (每个分片内的底层引擎)"] subgraph SegA["Segment A (不可变)"] IIA["倒排索引"] SFA["Stored Fields"] DVA["Doc Values"] end subgraph SegB["Segment B (不可变)"] IIB["倒排索引"] SFB["Stored Fields"] DVB["Doc Values"] end end end %% === 核心链路简化 === %% 1. 请求路由: Coord -> Data Node (物理流向: 先找到持有分片的节点) Coord --"路由写入 / 分发查询"--> Data1 %% 2. 数据处理: Data Node -> Shard (节点调用内部主要分片) Data3 -- "执行请求" --> P0 %% 3. 数据同步: Primary -> Replica (跨节点同步) P1 -.-> R1 P0 -.-> R0 %% 4. 其它关系示意 Master3 -.-|"集群管理"| Data3 R0 -.-|"底层引擎"| SegA %% === 样式定义 === style Cluster fill:#f5f5f5,stroke:#333,stroke-width:2px style CoordLayer fill:#e3f2fd,stroke:#1565c0,stroke-dasharray: 5 5 style MasterLayer fill:#f3e5f5,stroke:#7b1fa2,stroke-dasharray: 5 5 style DataLayer fill:#e8f5e9,stroke:#2e7d32,stroke-dasharray: 5 5 style IndexLayer fill:#fff9c4,stroke:#fbc02d,stroke-dasharray: 5 5 style LuceneLayer fill:#fff3e0,stroke:#e65100 style SegA fill:#ffffff,stroke:#333 style SegB fill:#ffffff,stroke:#333 classDiagram class Cluster { +String cluster_name +List nodes } class Node { +String node_name +List roles +store_data() +coordinate_request() } class NodeRoles { Master Data Coordinating } note for NodeRoles "枚举类型 (Enumeration)<br/>Master: 集群管理<br/>Data: 数据存储<br/>Coordinating: 请求转发" class Index { +String index_name +List shards } class Shard { +int shard_id +LuceneInstance underlying_engine } class PrimaryShard { +sync_to_replica() } class ReplicaShard { +promote_to_primary() } class LuceneInstance { +List segments +merge_segments() } class Segment { +InvertedIndex +StoredFields +DocValues +is_immutable } Cluster *-- Node Node ..> NodeRoles : 扮演角色 Index *-- Shard Shard <|-- PrimaryShard Shard <|-- ReplicaShard Shard *-- LuceneInstance : 底层引擎 LuceneInstance "1" *-- "*" Segment : 包含多个 Node "1" o-- "*" Shard : 承载 note for PrimaryShard "负责写入,同步数据给副本" note for ReplicaShard "提供读能力,主分片挂掉后升级" note for Node "单节点可扩展为多节点集群<br/>通过 Raft 类机制选主" note for LuceneInstance "写入产生新Segment<br/>后台定期合并小Segment<br/>以优化文件句柄和查询性能" 从 Lucene 到 Elasticsearch 的演进 Lucene 作为一个单纯的搜索库,虽然功能强大,但在面对海量数据和高并发场景时存在明显的单机局限性: ...