← All posts
Engineering10 min read

On-Device ML vs Server-Side Targeting: Latency, Privacy, and Performance Compared

Server-side targeting has a privacy problem and a latency problem. On-device ML solves both — but how does it actually work? A technical comparison of architectures, latency profiles, and conversion outcomes.

On-Device ML vs Server-Side Targeting: Latency, Privacy, and Performance Compared

The dominant model in behavioral targeting sends raw event data to a server, runs inference centrally, and returns a targeting decision. It has been the default for 15 years. It works well enough when latency is measured in minutes and privacy is measured in checkbox compliance.

Neither of those conditions holds anymore. High-intent behavioral windows close in seconds, not minutes. Regulatory exposure from centralizing raw behavioral events is escalating. A targeting architecture optimized for 2010's constraints is increasingly a liability in 2025.

This post compares two architectures at the engineering level: server-side inference, where raw events travel to a central processing system, and on-device inference, where the model runs locally and only anonymized scores are transmitted. We cover the latency profile of each, the privacy surface area each creates, and the practical performance outcomes.


1. Two Architectures, Two Philosophies

The difference between server-side and on-device ML targeting is not a configuration option. It is a fundamental architectural decision about where inference happens, and consequently, where data travels.

Server-side targeting operates on the premise that compute is cheap and centralized inference is the simplest architecture. Every behavioral event — a page view, a click, a scroll depth measurement, a session idle timeout — is transmitted to a server as it occurs. The server maintains the user's behavioral state, runs the ML model when a trigger condition is met, and returns a targeting decision. The user's device is a data source, not a compute resource.

On-device targeting inverts this. The ML model runs as a local artifact on the user's device — the same hardware that generated the behavioral events. Feature extraction, embedding computation, and intent inference all happen locally. The device does not ask a server "should I show this user a campaign?" — it answers that question itself, then transmits the result as a compact score rather than transmitting the raw question data.

The philosophical divide is about trust and architecture. Server-side targeting implicitly trusts that centralized processing is acceptable. On-device targeting treats local inference as the default because raw behavioral data should not leave the device that generated it.


2. Server-Side Targeting: The Latency Problem

Every hop in a server-side targeting pipeline adds latency. When you enumerate the hops, the cumulative total is more than most product teams expect.

A standard server-side pipeline for a behavioral trigger looks like this:

StageDescriptionLatency Range
Event emissionSDK captures event, serializes payload0–2ms
Network transit (client → server)TCP + TLS handshake on 4G/LTE40–150ms
Server queue waitLoad balancer + ingestion service queue5–50ms
Event persistenceWrite to Kafka or database10–30ms
Feature computationAggregate new event with user history20–80ms
Model inferenceForward pass through scoring model20–100ms
Response generationSerialize decision, route to activation5–20ms
Response transit (server → client)Network return trip40–150ms
Total (optimistic)~140–300ms
Total (typical)~300–582ms

And that is the optimistic case — an already-warm inference server, a low-traffic moment, and a user on a fast network. In practice, three additional factors inflate this budget substantially:

Cold starts. Cloud-based inference servers scale to zero during off-peak hours. When the first events of a morning session arrive, the inference container may require 1–3 seconds to start. By the time the first targeting decision is ready, the user has already moved past the trigger moment.

Batch scheduling lag in CDPs. Most commercial CDPs do not run inference on every event in real time. They accumulate events, run a batch scoring job on a schedule (every 15 minutes, every hour, or nightly), and update segment membership after the batch completes. The stated "real-time" capability of many CDPs means "updated within 15 minutes," not "updated in response to this event." End-to-end latency from event to segment update: 15 minutes to 24 hours.

Network jitter on mobile. Mobile connections are not stable. 4G jitter adds variance of 20–100ms per hop. A session that starts on WiFi, transitions to LTE, and passes through a tunnel while browsing may experience 400ms+ network latency for a single round-trip at the critical trigger moment.

The result: a targeting system that is nominally "real-time" but structurally incapable of acting within the 100–300ms window in which high-intent behavioral signals carry peak value.


3. Server-Side Targeting: The Privacy Problem

Latency is an engineering problem. The privacy problem created by server-side targeting is increasingly a legal one.

When every behavioral event is transmitted to a server, the server receives a continuous stream of data that is PII-adjacent even when it does not contain explicit personal identifiers. A session log containing:

user_device_id=3a8f21c4
timestamp=2025-04-10T14:22:07Z  action=page_view  url=/products/trail-shoes-x3  session_depth=3
timestamp=2025-04-10T14:22:19Z  action=scroll     depth_pct=78  dwell_ms=12400
timestamp=2025-04-10T14:22:31Z  action=add_to_cart  product_id=TSX3  qty=1  price=118.00
timestamp=2025-04-10T14:22:44Z  action=page_view  url=/checkout  
timestamp=2025-04-10T14:23:02Z  action=exit       referrer=null

...contains behavioral patterns (browsing style, decision latency, price sensitivity), a device fingerprint, a precise timestamp sequence, and a session path that is potentially linkable to a real identity when combined with network metadata. Under GDPR, this data stream constitutes personal data if the device ID is persistent and can be linked to a natural person — which it almost always can.

Every hop this data traverses creates compliance surface area:

  • At-rest in the ingestion buffer: requires encryption and access controls
  • In the stream processor: requires DPA with the processing infrastructure provider
  • In the feature store: requires data retention schedules and deletion capability
  • In the ML training pipeline: raw events used for model training create a secondary data store subject to the same subject rights as the original
  • At the ad network: third-party recipients of behavioral data become joint controllers or processors, each requiring a scoped DPA

The aggregate compliance burden of a server-side pipeline processing raw behavioral events at scale is not a legal overlay — it is an ongoing operational cost that grows with data volume, pipeline complexity, and regulatory scope.


4. On-Device ML: How It Works

MicroTarget's on-device architecture moves the inference boundary to the SDK running on the user's device. The pipeline on-device is as follows:

[Behavioral Events]
       |
       v
[Local WAL Buffer]          -- Write-ahead log, persists across app lifecycle
       |
       v
[Feature Extraction]        -- Compute session features: dwell, depth, velocity, etc.
       |
       v
[Embedding Layer]           -- Dense vector representations, <15ms compute
       |
       v
[Intent Model Inference]    -- Outputs: engagement_intent, purchase_intent, churn_risk
       |
       v
[Local Trigger Evaluation]  -- Compare scores against campaign thresholds
       |                          ↓ (if trigger fires)
       |                    [3 float scores transmitted to server]
       v
[No transmission if no trigger]

The write-ahead log (WAL) buffer ensures that events generated while the device is offline — airplane mode, poor connectivity, backgrounded app — are not lost. When the session resumes, the buffer replays events into the feature extraction pipeline.

Feature extraction computes a set of behavioral features from the raw event stream locally: session depth, scroll velocity, product view recency, cart state, time-in-session, interaction density, idle gap duration, and checkout funnel position. These features are intermediate representations — they describe the session state, not the raw events.

The embedding layer maps the feature vector into a dense representation optimized for the intent model's input space. Compute time on a modern mobile SoC: under 15ms. This layer is updated with each new event, maintaining a rolling embedded representation of the current session state.

The intent model runs inference on the embedded features and outputs three scores:

  • engagement_intent ∈ [0,1] — probability of continued active engagement
  • purchase_intent ∈ [0,1] — probability of completing a transaction in the current session
  • churn_risk ∈ [0,1] — probability of disengagement within the next 7 days

These scores are the only data transmitted to the server. The raw events, the intermediate features, the embedding vectors — all remain on the device. A controller receiving three floating-point numbers has received no personal data, no behavioral history, and no PII-adjacent signals. They have received an anonymized behavioral state representation.


5. The 150ms Pipeline Explained

Once the on-device inference has produced scores and transmitted them to the server, the server-side portion of the pipeline completes the targeting decision and dispatches the activation. The full end-to-end budget from triggering event to campaign dispatch:

StageComponentCumulative Latency
Event captured on deviceSDK local processing0ms
Local feature extractionOn-device compute+5–10ms
Embedding computationOn-device, <15ms+15ms
Intent inference on deviceLocal model forward pass+20–35ms
Score transmission to serverNetwork (scores only, ~100 bytes)+15–40ms
Kafka ingestionScore event → topic+5–10ms
Stream processor lookupFeature Store read (RocksDB, P99 ~1ms)+2–5ms
Ranking engine (GSP)Campaign selection, <50ms budget+10–50ms
Trigger dispatchPush/email/webhook routing+5–15ms
Total end-to-end~77–165ms

The score payload transmitted over the network is approximately 100–200 bytes (three floats plus metadata). Compare this to a full event payload that may be 1–5KB per event, with dozens of events per session. The network transmission time is 5–10x faster because the payload is 20–50x smaller.

The Generalized Second Price (GSP) ranking engine selects the optimal campaign variant from the candidate set using the incoming intent scores as input features. The GSP algorithm produces monotonic and deterministic decisions: given the same input scores and the same candidate campaigns, it will always produce the same output. This determinism is not just an engineering convenience — it is an audit property. Every campaign selection can be replayed from its input scores.

The Feature Store lookup during ranking uses RocksDB as the backing KV store, with P99 read latency of approximately 1ms on local storage. The store contains accumulated account-level features (historical purchase frequency, lifetime value tier, subscription status) that complement the session-level scores from on-device inference. Together, session state plus account context gives the ranking engine a complete behavioral picture with sub-millisecond feature retrieval.


6. Multi-Modal Signal Fusion: What Gets Processed On-Device

The intent model does not run on a single behavioral signal. It fuses eight signal types into a graph-style behavioral state representation before inference. The eight modalities are:

  1. Navigation depth — sequence and depth of pages/screens visited
  2. Interaction velocity — rate of clicks, taps, and scroll events per unit time
  3. Dwell time distribution — time spent per content element, normalized by content type
  4. Cart state dynamics — add/remove/view cycles, time from first view to cart action
  5. Session recency and frequency — recency of prior sessions, session cadence over 7/30 days
  6. Funnel position — checkout funnel stage, drop-off history
  7. Content affinity — category and product cluster signals from view history
  8. Idle and exit patterns — idle gap distribution, background/foreground transitions

These eight signals are individually weak predictors of intent. Each one, in isolation, produces noisy estimates with high variance. A user with a long dwell time might be reading carefully or have left their device open. A user with high interaction velocity might be excited or frustrated.

The power of multi-modal fusion is that signals correlate and contradict in patterns that are diagnostically meaningful. High dwell time combined with high interaction velocity and a direct path to checkout is a very different behavioral signature from high dwell time combined with low interaction velocity and back-navigation. The fused representation captures these cross-signal patterns.

Combining all eight signal types produces estimates that are 4–7x more stable than single-modality systems, measured by coefficient of variation in purchase-intent scores across users with identical purchase outcomes. The reduction in variance is the key metric: more stable scores mean fewer false positive trigger firings, better campaign precision, and higher conversion rates per impression.

The graph-style fusion architecture treats signals as nodes in a behavioral graph, with edges representing temporal and causal relationships between events. This structure allows the model to reason about sequences ("viewed product, left, returned 20 minutes later, added to cart") rather than treating events as independent observations.


7. Performance Comparison Table

DimensionServer-Side CDPOn-Device ML (MicroTarget)
Event-to-decision latency300–582ms (real-time mode) / 15min–24h (batch mode)77–165ms end-to-end
Cold start sensitivityHigh — inference server scale-to-zero adds 1–3sNone — model runs on device, always available
Batch dependencyRequired for segment updates in most CDPsNone — inference is continuous, per-session
Data warehouse requirementYes — central store required for feature computationNo — features computed locally, only scores transmitted
Personal data transmittedFull raw event stream (KB per event, PII-adjacent)3 float scores per trigger (~100 bytes, non-personal)
GDPR compliance surfaceHigh — raw events at rest, in transit, and with processorsMinimal — no personal data transmitted for inference
Data subject deletion scopeMulti-system (pipeline, warehouse, feature store, ML training data, ad network)Account-level records only; on-device data deleted locally
Signal stability (CV)High variance — single or dual modality common4–7x lower variance via 8-signal fusion
Explainability (GDPR Art. 22)Typically a black-box scoreConcept Bottleneck: every decision has an auditable reason
Cold-start sensitivityHighNone
Offline resilienceNone — requires connectivity for inferenceFull — WAL buffer replays events when connectivity resumes
Third-party processor exposureYes — raw events reach ad network processorsNo — scores only; no behavioral history transmitted
Infrastructure required at customerFull pipeline (ingestion, processing, warehouse, serving)SDK only (server-side ranking is managed infrastructure)

The Architectural Trade-Off

Server-side inference is simpler to build from scratch. All compute is centralized. The feature engineering, model training, and inference pipeline live in one place. You do not need to distribute model artifacts to client devices or manage model versioning across a heterogeneous device fleet.

The cost of that simplicity is paid in latency, privacy surface area, and compliance burden. For organizations targeting low-intent, high-frequency interactions where a 15-minute lag is acceptable and compliance is a second-order concern, server-side CDPs remain a practical choice.

For organizations targeting high-intent behavioral windows — cart abandonment, conversion moment, churn risk — where the targeting decision is only valuable within a 100–300ms window, on-device inference is not a marginal improvement. It is a structural requirement. The server-side pipeline cannot physically complete within the window, regardless of infrastructure investment.

The privacy advantage is equally structural. Moving inference to the device does not require policy changes, DPA amendments, or consent redesigns to reduce exposure — it changes the architecture so that the exposure does not exist.


See It in Action

MicroTarget's interactive simulation demonstrates the latency and conversion impact of real-time on-device targeting compared to batch CDP baselines using your own revenue parameters.

Run the cost-of-delay simulation to see how the difference between 150ms and 15-minute targeting latency translates into revenue for your specific traffic volume and conversion rate.

on-device machine learning advertisingserver-side targeting latencyreal-time targeting architectureon-device ML privacybehavioral targeting latencyMicroTarget on-device ML

Ready to see MicroTarget in action?

Try the interactive Cost of Delay simulator or schedule a technical walkthrough with our team.