On-Device ML vs Server-Side Targeting: Latency, Privacy, and Performance Compared
The dominant model in behavioral targeting sends raw event data to a server, runs inference centrally, and returns a targeting decision. It has been the default for 15 years. It works well enough when latency is measured in minutes and privacy is measured in checkbox compliance.
Neither of those conditions holds anymore. High-intent behavioral windows close in seconds, not minutes. Regulatory exposure from centralizing raw behavioral events is escalating. A targeting architecture optimized for 2010's constraints is increasingly a liability in 2025.
This post compares two architectures at the engineering level: server-side inference, where raw events travel to a central processing system, and on-device inference, where the model runs locally and only anonymized scores are transmitted. We cover the latency profile of each, the privacy surface area each creates, and the practical performance outcomes.
1. Two Architectures, Two Philosophies
The difference between server-side and on-device ML targeting is not a configuration option. It is a fundamental architectural decision about where inference happens, and consequently, where data travels.
Server-side targeting operates on the premise that compute is cheap and centralized inference is the simplest architecture. Every behavioral event — a page view, a click, a scroll depth measurement, a session idle timeout — is transmitted to a server as it occurs. The server maintains the user's behavioral state, runs the ML model when a trigger condition is met, and returns a targeting decision. The user's device is a data source, not a compute resource.
On-device targeting inverts this. The ML model runs as a local artifact on the user's device — the same hardware that generated the behavioral events. Feature extraction, embedding computation, and intent inference all happen locally. The device does not ask a server "should I show this user a campaign?" — it answers that question itself, then transmits the result as a compact score rather than transmitting the raw question data.
The philosophical divide is about trust and architecture. Server-side targeting implicitly trusts that centralized processing is acceptable. On-device targeting treats local inference as the default because raw behavioral data should not leave the device that generated it.
2. Server-Side Targeting: The Latency Problem
Every hop in a server-side targeting pipeline adds latency. When you enumerate the hops, the cumulative total is more than most product teams expect.
A standard server-side pipeline for a behavioral trigger looks like this:
| Stage | Description | Latency Range |
|---|---|---|
| Event emission | SDK captures event, serializes payload | 0–2ms |
| Network transit (client → server) | TCP + TLS handshake on 4G/LTE | 40–150ms |
| Server queue wait | Load balancer + ingestion service queue | 5–50ms |
| Event persistence | Write to Kafka or database | 10–30ms |
| Feature computation | Aggregate new event with user history | 20–80ms |
| Model inference | Forward pass through scoring model | 20–100ms |
| Response generation | Serialize decision, route to activation | 5–20ms |
| Response transit (server → client) | Network return trip | 40–150ms |
| Total (optimistic) | ~140–300ms | |
| Total (typical) | ~300–582ms |
And that is the optimistic case — an already-warm inference server, a low-traffic moment, and a user on a fast network. In practice, three additional factors inflate this budget substantially:
Cold starts. Cloud-based inference servers scale to zero during off-peak hours. When the first events of a morning session arrive, the inference container may require 1–3 seconds to start. By the time the first targeting decision is ready, the user has already moved past the trigger moment.
Batch scheduling lag in CDPs. Most commercial CDPs do not run inference on every event in real time. They accumulate events, run a batch scoring job on a schedule (every 15 minutes, every hour, or nightly), and update segment membership after the batch completes. The stated "real-time" capability of many CDPs means "updated within 15 minutes," not "updated in response to this event." End-to-end latency from event to segment update: 15 minutes to 24 hours.
Network jitter on mobile. Mobile connections are not stable. 4G jitter adds variance of 20–100ms per hop. A session that starts on WiFi, transitions to LTE, and passes through a tunnel while browsing may experience 400ms+ network latency for a single round-trip at the critical trigger moment.
The result: a targeting system that is nominally "real-time" but structurally incapable of acting within the 100–300ms window in which high-intent behavioral signals carry peak value.
3. Server-Side Targeting: The Privacy Problem
Latency is an engineering problem. The privacy problem created by server-side targeting is increasingly a legal one.
When every behavioral event is transmitted to a server, the server receives a continuous stream of data that is PII-adjacent even when it does not contain explicit personal identifiers. A session log containing:
user_device_id=3a8f21c4
timestamp=2025-04-10T14:22:07Z action=page_view url=/products/trail-shoes-x3 session_depth=3
timestamp=2025-04-10T14:22:19Z action=scroll depth_pct=78 dwell_ms=12400
timestamp=2025-04-10T14:22:31Z action=add_to_cart product_id=TSX3 qty=1 price=118.00
timestamp=2025-04-10T14:22:44Z action=page_view url=/checkout
timestamp=2025-04-10T14:23:02Z action=exit referrer=null
...contains behavioral patterns (browsing style, decision latency, price sensitivity), a device fingerprint, a precise timestamp sequence, and a session path that is potentially linkable to a real identity when combined with network metadata. Under GDPR, this data stream constitutes personal data if the device ID is persistent and can be linked to a natural person — which it almost always can.
Every hop this data traverses creates compliance surface area:
- ▸At-rest in the ingestion buffer: requires encryption and access controls
- ▸In the stream processor: requires DPA with the processing infrastructure provider
- ▸In the feature store: requires data retention schedules and deletion capability
- ▸In the ML training pipeline: raw events used for model training create a secondary data store subject to the same subject rights as the original
- ▸At the ad network: third-party recipients of behavioral data become joint controllers or processors, each requiring a scoped DPA
The aggregate compliance burden of a server-side pipeline processing raw behavioral events at scale is not a legal overlay — it is an ongoing operational cost that grows with data volume, pipeline complexity, and regulatory scope.
4. On-Device ML: How It Works
MicroTarget's on-device architecture moves the inference boundary to the SDK running on the user's device. The pipeline on-device is as follows:
[Behavioral Events]
|
v
[Local WAL Buffer] -- Write-ahead log, persists across app lifecycle
|
v
[Feature Extraction] -- Compute session features: dwell, depth, velocity, etc.
|
v
[Embedding Layer] -- Dense vector representations, <15ms compute
|
v
[Intent Model Inference] -- Outputs: engagement_intent, purchase_intent, churn_risk
|
v
[Local Trigger Evaluation] -- Compare scores against campaign thresholds
| ↓ (if trigger fires)
| [3 float scores transmitted to server]
v
[No transmission if no trigger]
The write-ahead log (WAL) buffer ensures that events generated while the device is offline — airplane mode, poor connectivity, backgrounded app — are not lost. When the session resumes, the buffer replays events into the feature extraction pipeline.
Feature extraction computes a set of behavioral features from the raw event stream locally: session depth, scroll velocity, product view recency, cart state, time-in-session, interaction density, idle gap duration, and checkout funnel position. These features are intermediate representations — they describe the session state, not the raw events.
The embedding layer maps the feature vector into a dense representation optimized for the intent model's input space. Compute time on a modern mobile SoC: under 15ms. This layer is updated with each new event, maintaining a rolling embedded representation of the current session state.
The intent model runs inference on the embedded features and outputs three scores:
- ▸
engagement_intent ∈ [0,1]— probability of continued active engagement - ▸
purchase_intent ∈ [0,1]— probability of completing a transaction in the current session - ▸
churn_risk ∈ [0,1]— probability of disengagement within the next 7 days
These scores are the only data transmitted to the server. The raw events, the intermediate features, the embedding vectors — all remain on the device. A controller receiving three floating-point numbers has received no personal data, no behavioral history, and no PII-adjacent signals. They have received an anonymized behavioral state representation.
5. The 150ms Pipeline Explained
Once the on-device inference has produced scores and transmitted them to the server, the server-side portion of the pipeline completes the targeting decision and dispatches the activation. The full end-to-end budget from triggering event to campaign dispatch:
| Stage | Component | Cumulative Latency |
|---|---|---|
| Event captured on device | SDK local processing | 0ms |
| Local feature extraction | On-device compute | +5–10ms |
| Embedding computation | On-device, <15ms | +15ms |
| Intent inference on device | Local model forward pass | +20–35ms |
| Score transmission to server | Network (scores only, ~100 bytes) | +15–40ms |
| Kafka ingestion | Score event → topic | +5–10ms |
| Stream processor lookup | Feature Store read (RocksDB, P99 ~1ms) | +2–5ms |
| Ranking engine (GSP) | Campaign selection, <50ms budget | +10–50ms |
| Trigger dispatch | Push/email/webhook routing | +5–15ms |
| Total end-to-end | ~77–165ms |
The score payload transmitted over the network is approximately 100–200 bytes (three floats plus metadata). Compare this to a full event payload that may be 1–5KB per event, with dozens of events per session. The network transmission time is 5–10x faster because the payload is 20–50x smaller.
The Generalized Second Price (GSP) ranking engine selects the optimal campaign variant from the candidate set using the incoming intent scores as input features. The GSP algorithm produces monotonic and deterministic decisions: given the same input scores and the same candidate campaigns, it will always produce the same output. This determinism is not just an engineering convenience — it is an audit property. Every campaign selection can be replayed from its input scores.
The Feature Store lookup during ranking uses RocksDB as the backing KV store, with P99 read latency of approximately 1ms on local storage. The store contains accumulated account-level features (historical purchase frequency, lifetime value tier, subscription status) that complement the session-level scores from on-device inference. Together, session state plus account context gives the ranking engine a complete behavioral picture with sub-millisecond feature retrieval.
6. Multi-Modal Signal Fusion: What Gets Processed On-Device
The intent model does not run on a single behavioral signal. It fuses eight signal types into a graph-style behavioral state representation before inference. The eight modalities are:
- ▸Navigation depth — sequence and depth of pages/screens visited
- ▸Interaction velocity — rate of clicks, taps, and scroll events per unit time
- ▸Dwell time distribution — time spent per content element, normalized by content type
- ▸Cart state dynamics — add/remove/view cycles, time from first view to cart action
- ▸Session recency and frequency — recency of prior sessions, session cadence over 7/30 days
- ▸Funnel position — checkout funnel stage, drop-off history
- ▸Content affinity — category and product cluster signals from view history
- ▸Idle and exit patterns — idle gap distribution, background/foreground transitions
These eight signals are individually weak predictors of intent. Each one, in isolation, produces noisy estimates with high variance. A user with a long dwell time might be reading carefully or have left their device open. A user with high interaction velocity might be excited or frustrated.
The power of multi-modal fusion is that signals correlate and contradict in patterns that are diagnostically meaningful. High dwell time combined with high interaction velocity and a direct path to checkout is a very different behavioral signature from high dwell time combined with low interaction velocity and back-navigation. The fused representation captures these cross-signal patterns.
Combining all eight signal types produces estimates that are 4–7x more stable than single-modality systems, measured by coefficient of variation in purchase-intent scores across users with identical purchase outcomes. The reduction in variance is the key metric: more stable scores mean fewer false positive trigger firings, better campaign precision, and higher conversion rates per impression.
The graph-style fusion architecture treats signals as nodes in a behavioral graph, with edges representing temporal and causal relationships between events. This structure allows the model to reason about sequences ("viewed product, left, returned 20 minutes later, added to cart") rather than treating events as independent observations.
7. Performance Comparison Table
| Dimension | Server-Side CDP | On-Device ML (MicroTarget) |
|---|---|---|
| Event-to-decision latency | 300–582ms (real-time mode) / 15min–24h (batch mode) | 77–165ms end-to-end |
| Cold start sensitivity | High — inference server scale-to-zero adds 1–3s | None — model runs on device, always available |
| Batch dependency | Required for segment updates in most CDPs | None — inference is continuous, per-session |
| Data warehouse requirement | Yes — central store required for feature computation | No — features computed locally, only scores transmitted |
| Personal data transmitted | Full raw event stream (KB per event, PII-adjacent) | 3 float scores per trigger (~100 bytes, non-personal) |
| GDPR compliance surface | High — raw events at rest, in transit, and with processors | Minimal — no personal data transmitted for inference |
| Data subject deletion scope | Multi-system (pipeline, warehouse, feature store, ML training data, ad network) | Account-level records only; on-device data deleted locally |
| Signal stability (CV) | High variance — single or dual modality common | 4–7x lower variance via 8-signal fusion |
| Explainability (GDPR Art. 22) | Typically a black-box score | Concept Bottleneck: every decision has an auditable reason |
| Cold-start sensitivity | High | None |
| Offline resilience | None — requires connectivity for inference | Full — WAL buffer replays events when connectivity resumes |
| Third-party processor exposure | Yes — raw events reach ad network processors | No — scores only; no behavioral history transmitted |
| Infrastructure required at customer | Full pipeline (ingestion, processing, warehouse, serving) | SDK only (server-side ranking is managed infrastructure) |
The Architectural Trade-Off
Server-side inference is simpler to build from scratch. All compute is centralized. The feature engineering, model training, and inference pipeline live in one place. You do not need to distribute model artifacts to client devices or manage model versioning across a heterogeneous device fleet.
The cost of that simplicity is paid in latency, privacy surface area, and compliance burden. For organizations targeting low-intent, high-frequency interactions where a 15-minute lag is acceptable and compliance is a second-order concern, server-side CDPs remain a practical choice.
For organizations targeting high-intent behavioral windows — cart abandonment, conversion moment, churn risk — where the targeting decision is only valuable within a 100–300ms window, on-device inference is not a marginal improvement. It is a structural requirement. The server-side pipeline cannot physically complete within the window, regardless of infrastructure investment.
The privacy advantage is equally structural. Moving inference to the device does not require policy changes, DPA amendments, or consent redesigns to reduce exposure — it changes the architecture so that the exposure does not exist.
See It in Action
MicroTarget's interactive simulation demonstrates the latency and conversion impact of real-time on-device targeting compared to batch CDP baselines using your own revenue parameters.
Run the cost-of-delay simulation to see how the difference between 150ms and 15-minute targeting latency translates into revenue for your specific traffic volume and conversion rate.