System Architecture

Each tab is a top-down flow with live status. Click any box to see what happens inside, in order.

Search Query Flow

๐Ÿ‘ค
Client
browser ยท mobile
HTTPS ยท search query
๐Ÿ…ฟ๏ธ
PHP / Laravel
www.dumyah.com
/api/query
๐Ÿ”
ML /query
intent + entity extraction
/embed
๐Ÿง 
ML /embed
BGE-M3 fp16 ยท GPU
dense + sparse vectors
๐Ÿ”Ž
Elasticsearch
BM25 + KNN + sparse + RRF
top-K candidates
๐Ÿ†
ML /rank
LTR LambdaRank
scored + reordered
๐Ÿ—„๏ธ
MySQL
product details enrichment
JSON response
โœ…
Client response
ranked product list

Catalog Indexing Flow

โœ๏ธ
PHP catalog update
create / edit product in Laravel
POST /reindex/delta ยท product_ids[]
๐Ÿ”„
Delta reindex worker
capped 500 IDs ยท ~30ms each
mget product source from ES
โš™๏ธ
ML embed (BGE-M3)
build_embed_text โ†’ encode
bulk update title_vector + sparse
๐Ÿ’พ
ES write
products index
searchable
๐Ÿ”Ž
Elasticsearch (shared)
shared with search flow

Search-event ingestion (parallel)

Every search/click is logged to ClickHouse for LTR training data + analytics.

PHP โ†’
๐Ÿ“Š
ClickHouse
search_events ยท ltr_training_data

LTR Training Pipeline

๐Ÿšฆ
Trigger
manual ยท cron ยท CLI
start_training()
๐Ÿ“Š
ClickHouse data fetch
ltr_training_data ยท last 90 days
X, y, qids, positions
๐Ÿงฎ
Preprocess
log transform ยท label grades
cleaned arrays
โœ‚๏ธ
3-way split (70/15/15)
train ยท val ยท test (time-based)
three folds
โš–๏ธ
Sample weights
position bias ร— GMV bonus
weighted folds
๐Ÿค–
Train LightGBM LambdaRank
500 rounds ยท early-stop on val
best-iteration model
๐Ÿ“
Evaluate (test + val)
NDCG@5/@10 ยท MAP ยท gen-gap
metrics JSON
๐Ÿ’พ
Save model ยท vN+1
keep last 10 ยท audit log
attention_required
๐Ÿ†
Active LTR model
operator clicks /activate

Weekly retrain cron + frozen-set eval

Two scheduled jobs feed the pipeline above. Pause / run-now from Admin โ†’ Schedules.

๐Ÿ“…
weekly-ltr-retrain
Mon 02:00 UTC ยท save-only, no auto-promote
๐ŸงŠ
daily-frozen-eval
03:00 UTC ยท scores every saved model on a stable slice

BGE-M3 Fine-tune Pipeline

High-stakes, manual, ~quarterly. Each step is gated by human review โ€” never auto-run end-to-end. Trigger steps from Admin โ†’ Fine-Tuning.

๐ŸŽฏ
Decision to fine-tune
evidence-driven ยท ~quarterly
POST /api/admin/finetune/export
โ›๏ธ
1. Mine triplets from CH+ES
(query, positive, hard-negatives)
triplets_YYYYMMDD.jsonl
๐Ÿ”
2. Validate triplets
count ยท language balance ยท dedup
human gate ยท proceed?
๐Ÿง 
3. Fine-tune BGE-M3
train_bge_m3.sh ยท GPU ยท hours
/app/model-cache/finetune/<name>/
๐Ÿงช
4. Smoke test (offline)
eval_finetuned.py ยท baseline vs new
human gate ยท ship?
๐Ÿ”€
5. Activate (hot-swap embedder)
POST /finetune/activate ยท embedder.load()
โš ๏ธ ES vectors are now from the OLD model
๐Ÿ”
6. Full reindex (~87K docs)
POST /reindex ยท 30โ€“60 min on RTX 4000
vectors aligned with new model
โœ…
7. Verify post-deploy
live smoke + Grafana watch

โš ๏ธ Rollback procedure

If post-deploy verification surfaces issues, rollback is two steps:

  1. POST /api/admin/finetune/activate { model_path: "<previous>" } โ€” hot-swap back to the previous embedder
  2. POST /reindex โ€” rebuild ES vectors with the previous model (~30โ€“60 min)

Search quality is degraded between (1) and (2) โ€” rollback is faster if the previous model directory is still on disk.

Visual Search Pipeline (SigLIP 2)

Apache-2.0 multilingual vision-language model adds a visual lane to retrieval. Image encoder runs at index time (~50 min one-shot, ~2 min/week delta); text encoder runs at query time (~80 ms p50 on GPU). Both share a 1152-dim embedding space so text queries KNN-match against product image vectors with no extra translation step. Kill-switched off by default until rollout.

๐Ÿ“ฅ Index-time (batch)

โš™๏ธ
PHP: index:product-images
--missing-only (default) or --all ยท cron weekly
scroll ES products โ†’ batches of 64 by image URL
๐Ÿ“ก
POST /image-embed/batch
ML service ยท async URL fetch + GPU forward pass
parallel httpx fetch ยท ~30 img/sec on RTX 4000
๐Ÿ–ผ
SigLIP 2 image encoder
google/siglip2-large-patch16-384 ยท ~2.5 GB VRAM ยท L2-normalized 1152-d
PHP bulk_update partial doc
๐Ÿ—„
ES products[image_vector]
dense_vector dims=1152 similarity=cosine
Layer-2 vector preservation (already in RollingIndexManager) also copies image_vector from old โ†’ new index before alias swaps when SEARCH_IMAGE_VECTORS_ENABLED=true โ€” so the May-10-style erosion can't happen to images either.

๐Ÿ” Query-time (per request)

๐Ÿ‘ค
User query (Arabic or English)
e.g. "ุฒูŠุช ุงู„ุฌุณู…" / "red lipstick"
PHP HybridStrategyRouter (Phase 2)
๐Ÿ“ก
POST /image-embed/text
ML service ยท text โ†’ vector in image space
SigLIP 2 TEXT encoder ยท ~80 ms p50
๐Ÿ“
Shared embedding space
same 1152-d output as images ยท L2-normalized
ES KNN against image_vector field
๐Ÿ—„
Top-K product candidates by visual sim
fused with BM25 + text-vector via RRF
f_image_text_similarity as LTR feature
๐Ÿ†
LightGBM LTR rerank (v18+)
visual sim joins 30+ other features
Phase 2: HybridStrategyRouter wires the image-vector KNN as a 4th retriever alongside BM25 + text-vector + sparse-vector. The textโ†’image lookup adds ~80 ms p50 to query latency; can be made optional per query via flag.

Why SigLIP 2 (not jina-clip-v2)?

  • License: Apache 2.0 (free commercial) vs jina-clip-v2's CC-BY-NC-4.0 (requires paid license, ~$10kโ€“50k/yr at our scale).
  • No trust_remote_code: standard SiglipModel in transformers, no custom Python downloaded from the HF Hub at load time.
  • Smaller: ~2.5 GB VRAM vs ~3 GB for jina-clip-v2.
  • Multilingual coverage: 100+ languages incl. Arabic vs 89 for jina-clip-v2.
  • Provenance: Google research, well-documented training data and benchmarks.
component

โ€”

What happens inside, in order

    Live probe data (raw)