Introduction to Lucene

Ingestion Process

Document Creation: User creates a Document object in memory. Data model: Map-like structure with Field objects (e.g., TextField for searchable text, StoredField for retrievable data). Stored in RAM as Java objects.
Analysis (Tokenization & Filtering): Tokenize (split into words), Normalize (lowercase, remove stopwords, stem words like "running" → "run").
Term Addition to Index: the terms are added to the index in Memoru
Segment Flushing: When buffer fills. data is flushed to disk as a new immutable segment
Commit & Merging: On commit, segments are merged (background) into larger ones for efficiency

Index Data model

Data model

ls -1 | cut -f2 -d. |sort | uniq
doc
dvd
dvm
fdm
fdt
fdx
fnm
lock
nvd
nvm
pos
segments_4
si
tim
tip
tmd

The most important files are tim, tip, doc and pos. The full model is the following

Vocabulary:
- .tim: Terms Dictionary with All unique terms (words)
- .tip: Terms Index Pointer/index into .tim
- .doc: Postings - Frequencies
- .pos: Postings - Positions
Stored Fields (Original Document Storage)
- .fdt: Field Data - Actual stored field values (like a database)
- .fdx: Field Index - Pointers to data in .fdt
- .fdm: Field Metadata - Compression info (Describes field types, analyzers, norms, etc.)
Doc Values (Column-oriented Storage)
- .dvd: Doc Values Data - For sorting/faceting
- .dvm: Doc Values Metadata
Norms (Field Length Normalization)
- .nvd: Norms Data - Field length info for scoring
- .nvm: Norms Metadata
Metadata Files
- .fnm: Field Names - Maps field IDs to names.
- .si: Segment Info - Segment metadata (doc count, codec, version, deleted docs, etc.).
- .tmd: Term Vector Metadata - For term vector storage. (Extra info for .tim and .tip.)
- segments_4: Master file listing all segments (Lists all segments, their versions, and commit metadata.)
- write.lock: Write lock (prevents concurrent writes)

Example

The document

# Document
0, "Hello World", "Lucene stores documents efficiently"
1, "Apache Lucene", "Lucene uses segments to store data"
2, "Search Engines", "Elasticsearch is built on Lucene"

Metadata files

# .fnm
0: title (indexed=true, stored=true, hasTermVectors=false)
1: body (indexed=true, stored=false, hasNorms=true)

# .tmd: Term Metadata, stores extra metadata about terms (field-level summaries, term stats, checksums).
Field "title": 3 unique terms
Field "body": 6 unique terms
checksum: 0xA32F9C

# .si: Segment Info, describes the whole segment.
Segment name: _2
Lucene version: 9.0
Doc count: 3
Deleted docs: 0
Files: [_2.fdt, _2.fdx, _2.tim, _2.tip, ...]

# segments_4: Commit point, global file listing all segments that make up the index.
Segments:
  _2 (3 docs)
  _3 (7 docs)
  _4 (2 docs)
Generation: 4

# write.lock
hostname=localhost
processId=12345

Stored fields

# .fdt: Documents and their stored field 
Doc 0:
  title = "Hello World"
Doc 1:
  title = "Apache Lucene"
Doc 2:
  title = "Search Engines"

# .fdx: offsets for each Doc to help lucene to seek inside .fdt
Doc 0 offset: 0
Doc 1 offset: 34
Doc 2 offset: 71

# .fdm: metadata about how fields are stored and indexed
Field "title":
  type: text
  analyzer: standard
  norms: no
Field "body":
  type: text
  analyzer: standard
  norms: yes

Dictionary files

# .tim: Term dictionary for indexed fields
Term Dictionary:
  body: [
    "built" -> docFreq=1, totalTermFreq=1
    "data" -> docFreq=1, totalTermFreq=1
    "elasticsearch" -> docFreq=1, totalTermFreq=1
    "lucene" -> docFreq=2, totalTermFreq=2
    "segments" -> docFreq=1, totalTermFreq=1
    "stores" -> docFreq=1, totalTermFreq=1
  ]
  title: [
    "apache" -> docFreq=1
    "hello" -> docFreq=1
    "search" -> docFreq=1
  ]

# .tip: Pointers for terms in .tim file (for fast seek)
Pointers:
  "apache" → offset 0
  "lucene" → offset 128
  "search" → offset 192

# .doc: Postings (docIDs), lists which documents contain each term. 
Term: "lucene"
  → docIDs = [1, 2]
Term: "search"
  → docIDs = [2]
Term: "hello"
  → docIDs = [0]

# .pos: Positions, word positions within documents (for phrase queries, proximity).
Term: "lucene"
  Doc 1: positions [0]
  Doc 2: positions [4]

Doc values (Columnar values)

# .dvd columnar storage for sorting, faceting, analytics.
Field "popularity" (numeric doc values)
Doc 0: 10
Doc 1: 25
Doc 2: 5

# .dvm: contains metadata (like offsets, encodings).
Field count: 2
Field 0: popularity (numeric)
  offset: 0x00000010
  encoding: delta-compressed int
Field 1: category (sorted)
  offset: 0x00000100
  encoding: terms dictionary

Norms

# .nvd per-field normalization factors (used in scoring).
Field: body
Doc 0: norm=0.577
Doc 1: norm=0.707
Doc 2: norm=0.5

# .nvm: norms metadata.
Field count: 1
Field 0: body (norms)
  offset: 0x00000000
  encoding: byte
  numDocs: 3

Field Settings

Each field in a lucene document has the following boolean separate settings:

indexed: The field is searchable (terms go into the inverted index).
stored: The field’s original value is saved so it can be retrieved with the document.
docValues: The field’s value is stored in columnar form for sorting, faceting, etc.

Norms

Norms are small numeric factors Lucene computes per field, per document to help with relevance scoring.

They typically encode things like:

How long the field is (shorter fields often get a boost),
Whether it contains many terms,
Field-level boosts applied at indexing time.

These are used when computing the TF-IDF or BM25 score that determines how relevant a document is to a query.

Doc values

Doc values are Lucene’s columnar data store — think of them like a per-field database column.

They’re designed for:

Sorting: e.g., sort search results by “price” or “date”
Faceting: e.g., count how many documents per “category”
Analytics: e.g., compute averages, histograms, or aggregations

Index operations

Deletions

The deletes are soft, each segment has bitset for each doc. 0 is set to set the doc for deletion.

On Segment merge, the segments with higher deleted docs are prioritized.

Updating a previously indexed document is a “cheap” delete followed by a re-insertion of the document. Updating a document is even more expensive than adding it in the first place. Thus, storing things like rapidly changing values in a Lucene index is probably not a good idea – there is no in-place update of values.

M'Goun