System Architecture

Each tab is a top-down flow with live status. Click any box to see what happens inside, in order.

Search Query Flow

👤

Client

browser · mobile

HTTPS · search query

🅿️

PHP / Laravel

www.dumyah.com

/api/query

🔍

ML /query

intent + entity extraction

/embed

🧠

ML /embed

BGE-M3 fp16 · GPU

dense + sparse vectors

🔎

Elasticsearch

BM25 + KNN + sparse + RRF

top-K candidates

🏆

ML /rank

LTR LambdaRank

scored + reordered

🗄️

MySQL

product details enrichment

JSON response

✅

Client response

ranked product list

Catalog Indexing Flow

✏️

PHP catalog update

create / edit product in Laravel

POST /reindex/delta · product_ids[]

🔄

Delta reindex worker

capped 500 IDs · ~30ms each

mget product source from ES

⚙️

ML embed (BGE-M3)

build_embed_text → encode

bulk update title_vector + sparse

💾

ES write

products index

searchable

🔎

Elasticsearch (shared)

shared with search flow

Search-event ingestion (parallel)

Every search/click is logged to ClickHouse for LTR training data + analytics.

PHP →

📊

ClickHouse

search_events · ltr_training_data

LTR Training Pipeline

🚦

Trigger

manual · cron · CLI

start_training()

📊

ClickHouse data fetch

ltr_training_data · last 90 days

X, y, qids, positions

🧮

Preprocess

log transform · label grades

cleaned arrays

✂️

3-way split (70/15/15)

train · val · test (time-based)

three folds

⚖️

Sample weights

position bias × GMV bonus

weighted folds

🤖

Train LightGBM LambdaRank

500 rounds · early-stop on val

best-iteration model

📐

Evaluate (test + val)

NDCG@5/@10 · MAP · gen-gap

metrics JSON

💾

Save model · vN+1

keep last 10 · audit log

attention_required

🏆

Active LTR model

operator clicks /activate

Weekly retrain cron + frozen-set eval

Two scheduled jobs feed the pipeline above. Pause / run-now from Admin → Schedules.

📅

weekly-ltr-retrain

Mon 02:00 UTC · save-only, no auto-promote

🧊

daily-frozen-eval

03:00 UTC · scores every saved model on a stable slice

BGE-M3 Fine-tune Pipeline

High-stakes, manual, ~quarterly. Each step is gated by human review — never auto-run end-to-end. Trigger steps from Admin → Fine-Tuning.

🎯

Decision to fine-tune

evidence-driven · ~quarterly

POST /api/admin/finetune/export

⛏️

1. Mine triplets from CH+ES

(query, positive, hard-negatives)

triplets_YYYYMMDD.jsonl

🔍

2. Validate triplets

count · language balance · dedup

human gate · proceed?

🧠

3. Fine-tune BGE-M3

train_bge_m3.sh · GPU · hours

/app/model-cache/finetune/<name>/

🧪

4. Smoke test (offline)

eval_finetuned.py · baseline vs new

human gate · ship?

🔀

5. Activate (hot-swap embedder)

POST /finetune/activate · embedder.load()

⚠️ ES vectors are now from the OLD model

🔁

6. Full reindex (~87K docs)

POST /reindex · 30–60 min on RTX 4000

vectors aligned with new model

✅

7. Verify post-deploy

live smoke + Grafana watch

⚠️ Rollback procedure

If post-deploy verification surfaces issues, rollback is two steps:

POST /api/admin/finetune/activate { model_path: "<previous>" } — hot-swap back to the previous embedder
POST /reindex — rebuild ES vectors with the previous model (~30–60 min)

Search quality is degraded between (1) and (2) — rollback is faster if the previous model directory is still on disk.

Visual Search Pipeline (SigLIP 2)

Apache-2.0 multilingual vision-language model adds a visual lane to retrieval. Image encoder runs at index time (~50 min one-shot, ~2 min/week delta); text encoder runs at query time (~80 ms p50 on GPU). Both share a 1152-dim embedding space so text queries KNN-match against product image vectors with no extra translation step. Kill-switched off by default until rollout.

📥 Index-time (batch)

⚙️

PHP: index:product-images

--missing-only (default) or --all · cron weekly

scroll ES products → batches of 64 by image URL

📡

POST /image-embed/batch

ML service · async URL fetch + GPU forward pass

parallel httpx fetch · ~30 img/sec on RTX 4000

🖼

SigLIP 2 image encoder

google/siglip2-large-patch16-384 · ~2.5 GB VRAM · L2-normalized 1152-d

PHP bulk_update partial doc

🗄

ES products[image_vector]

dense_vector dims=1152 similarity=cosine

Layer-2 vector preservation (already in RollingIndexManager) also copies image_vector from old → new index before alias swaps when SEARCH_IMAGE_VECTORS_ENABLED=true — so the May-10-style erosion can't happen to images either.

🔍 Query-time (per request)

👤

User query (Arabic or English)

e.g. "زيت الجسم" / "red lipstick"

PHP HybridStrategyRouter (Phase 2)

📡

POST /image-embed/text

ML service · text → vector in image space

SigLIP 2 TEXT encoder · ~80 ms p50

📝

Shared embedding space

same 1152-d output as images · L2-normalized

ES KNN against image_vector field

🗄

Top-K product candidates by visual sim

fused with BM25 + text-vector via RRF

f_image_text_similarity as LTR feature

🏆

LightGBM LTR rerank (v18+)

visual sim joins 30+ other features

Phase 2: HybridStrategyRouter wires the image-vector KNN as a 4th retriever alongside BM25 + text-vector + sparse-vector. The text→image lookup adds ~80 ms p50 to query latency; can be made optional per query via flag.

Why SigLIP 2 (not jina-clip-v2)?

License: Apache 2.0 (free commercial) vs jina-clip-v2's CC-BY-NC-4.0 (requires paid license, ~$10k–50k/yr at our scale).
No trust_remote_code: standard SiglipModel in transformers, no custom Python downloaded from the HF Hub at load time.
Smaller: ~2.5 GB VRAM vs ~3 GB for jina-clip-v2.
Multilingual coverage: 100+ languages incl. Arabic vs 89 for jina-clip-v2.
Provenance: Google research, well-documented training data and benchmarks.