How to Unify Customer Data from Multiple Sources Without a Data Warehouse
Most customer data platforms sell unification as a solved problem. The pitch is simple: funnel every data source into a central warehouse, run ETL on a schedule, and query the result. It sounds clean. In practice it introduces two problems that make real-time personalization structurally impossible: latency measured in hours, and a data surface area large enough to make compliance teams nervous.
This post explains why the warehouse-centric model breaks at the point of decision, and what a stream-first architecture looks like when you need unified behavioral profiles in under 500ms — without centralizing raw personal data.
1. The Data Silo Problem
A modern e-commerce or SaaS business accumulates behavioral data across at least four distinct systems, each with its own schema, cadence, and ownership boundary:
- ▸Mobile SDK events — session starts, screen views, in-app purchases, push notification interactions. Emitted by the client, timestamped on-device, typically batched and flushed every 10–30 seconds.
- ▸Web SDK events — page views, scroll depth, click streams, form interactions, checkout funnel steps. Near-continuous stream when the user is active.
- ▸CRM records — account-level attributes, subscription tier, support ticket history, lifecycle stage. Updated asynchronously, often daily batch exports.
- ▸Ad network signals — impression logs, click-throughs, attribution windows, retargeting lists. Delivered via webhooks or S3 drops, often delayed by the network's own batching schedule.
None of these systems know about each other. A user who clicks a retargeting ad on mobile, browses the web app, and then opens a support ticket has produced four separate event streams across three or four schemas with different user identifiers (device ID, cookie ID, CRM account ID, email hash). Joining them in real time requires identity resolution, schema normalization, and stateful aggregation — all before you can make a single targeting decision.
The naive solution is to push everything into a warehouse, run a nightly join, and export segments back to your activation layer. This works at a low cadence. It fails the moment "real-time" appears in your requirements.
2. Why Traditional CDPs Fail at Real-Time
Warehouse-centric CDPs process data in batch ETL cycles. A typical pipeline looks like this:
- ▸Source connectors pull raw events from each system on a polling interval (5 minutes to 1 hour depending on the connector).
- ▸Transformation jobs normalize schemas, resolve identities, and compute aggregate features.
- ▸The result lands in a warehouse table (Snowflake, BigQuery, Redshift).
- ▸Segment definitions run as SQL queries against those tables on a schedule.
- ▸Activation payloads are exported back to downstream channels.
End-to-end latency from a user event to a segment membership update: 15 minutes to 24 hours, depending on where in the ETL cycle the event arrived.
That gap is fatal for high-intent moments. A user who adds five items to a cart, pauses, and starts browsing competitor pages has a purchase-intent signal that decays within minutes. A batch CDP will trigger the cart-abandonment campaign the following morning, after the window has closed.
Beyond latency, the warehouse model creates a compliance surface area: all raw behavioral events — including PII-adjacent signals like session paths, device fingerprints, and geolocation sequences — are copied and stored at rest in a system you now need to audit, encrypt, and potentially expose to data subject access requests. Every source system becomes a replication target.
3. Stream-First Architecture: Kafka Partitioned by User ID
The alternative starts by treating the event stream as the source of truth, not a staging area for the warehouse.
The core ingestion layer is a Kafka or Redpanda cluster with topics partitioned by user_id mod N. This partitioning decision is load-bearing: because all events for a given user land on the same partition, they are processed by the same consumer in strict arrival order. Sequential ordering per user is a prerequisite for stateful aggregation — you cannot compute "added to cart within 5 minutes of viewing the product page" if events for the same user arrive out of order on different partitions.
Partition count N is sized to your throughput target and the number of stream processor replicas. A cluster with 128 partitions can sustain roughly 1–5 million events per second depending on message size and replication factor, with horizontal scaling available by increasing N.
The stream processor — Flink or Spark Structured Streaming — consumes from Kafka with stateful operators backed by RocksDB or an in-memory KV store. Stateful processing means the operator maintains per-user state across events without requiring a database round-trip for each event. A Flink job computing "session duration" or "time since last purchase" keeps a rolling window in local RocksDB, emitting updated feature values downstream as state changes.
This replaces the ETL batch cycle entirely. There is no "load" phase into a warehouse. Computed features flow directly into the Feature Store as they are produced.
4. The Distributed Feature Store: AP-Weighted CAP
The Feature Store is a distributed key-value system where each key is a user identifier and the value is a structured behavioral profile: a set of named features with numeric values and timestamps.
When designing a distributed KV store, the CAP theorem forces a choice under network partition: prioritize Consistency (C) or Availability (A). A warehouse-centric system implicitly chooses consistency — the segment query runs against a single source of truth. The stream-first Feature Store makes the opposite choice: AP-weighted (Availability + Partition Tolerance), trading strict consistency for guaranteed write acceptance and sub-millisecond read latency.
In practice this means:
- ▸Always accept writes. A stream processor emitting a feature update will never be blocked by a partition or replica lag. The write is accepted and propagated asynchronously.
- ▸Stale reads are acceptable. The decision engine tolerates feature values up to 500ms stale. For behavioral targeting, a purchase-intent score that is 400ms old is functionally equivalent to a perfectly fresh one.
- ▸Eventual consistency, not real-time consistency. Replicas converge after the partition heals. The window of inconsistency is typically 10–100ms on a healthy network.
Conflict resolution is handled at two levels:
- ▸Last-Write-Wins (LWW): The default strategy for most feature keys. When two writes arrive with different values for the same feature, the write with the higher logical timestamp wins. This is correct for continuous metrics like
session_depthorscroll_velocitywhere the latest observation supersedes all prior ones. - ▸Vector clocks: Used for critical segment membership flags — for example,
high_value_account: trueset by the CRM and potentially contradicted by a stale SDK event. Vector clocks track causality across write origins, allowing conflict detection at read time rather than blind overwrite.
The underlying storage is polyglot: Redis for hot-path KV access (sub-millisecond P99 reads), TimescaleDB for time-series features requiring temporal queries (e.g., "events in last 7 days"), PostgreSQL for identity mappings and OLTP-style account records, and S3 for cold feature history that informs offline model training.
5. Event-Driven Behavioral Profiles: From Domain Events to Decision State
The Feature Store is populated by three categories of events, each with a different propagation path:
Domain events are facts emitted by your core application — OrderPlaced, ProductViewed, SessionStarted, CartAbandoned. These are the raw behavioral signals. They carry no semantic interpretation; they describe what happened, not what it means.
Integration events are derived from domain events and cross service boundaries. The stream processor transforms ProductViewed + AddToCart + no OrderPlaced within 30min into an integration event like HighIntentCartAbandonment, which is then written to the Feature Store as a structured behavioral signal.
System events — emitted by infrastructure components via RabbitMQ — handle operational concerns like schema migrations, feature store flushes, and consumer health checks. These travel a separate bus and never pollute the behavioral event stream.
When a consumer fails to process an event (malformed payload, transient downstream error, schema mismatch), the event is routed to a Dead Letter Queue (DLQ). The DLQ prevents pipeline stalls: a bad event does not block the partition. Failed events are inspected, potentially transformed, and replayed into the main topic once the issue is resolved. Without a DLQ, a single malformed event on a user partition can halt all subsequent processing for that user.
TimescaleDB serves time-series features via hypertables — partitioned by time, with automatic chunk management. Continuous aggregates compute rolling windows (7-day purchase frequency, 30-day engagement score) incrementally as new data arrives, rather than recomputing from scratch on each query. This makes temporal feature reads O(1) against pre-materialized aggregates rather than O(n) full scans.
6. Practical Integration: What You Connect
Connecting to the stream-first architecture involves four integration patterns, each suited to a different data source:
SDK events (mobile and web) flush via the MicroTarget SDK directly to the ingestion endpoint, which writes to Kafka. The SDK maintains a local write-ahead log buffer during offline periods and flushes on reconnect. Events carry a device-generated UUID and a millisecond-precision timestamp, enabling deduplication at the stream processor.
Webhook ingestion handles ad network callbacks, payment processor events, and third-party SaaS hooks. A stateless ingestion service normalizes the webhook payload into the canonical event schema and writes to Kafka. Idempotency keys prevent duplicate processing on webhook retries.
CRM sync patterns are the most nuanced. CRM data is inherently batch — Salesforce, HubSpot, and similar systems do not emit a real-time event stream. The recommended pattern is a change-data-capture (CDC) feed from the CRM's PostgreSQL or MySQL backing store, streaming row-level changes into Kafka via Debezium. This converts the CRM's batch model into an approximate event stream with 5–30 second lag, acceptable for account-level features like subscription tier and lifecycle stage.
Backend service events are emitted directly by your application services via the integration event bus (RabbitMQ), then bridged into Kafka for stream processing. Services own their domain event schema; the stream processor handles cross-domain correlation.
7. What You Get: Real-Time Segments in 50–150ms
The end result of this architecture is segment membership that updates in 50 to 150 milliseconds from the triggering event — compared to 15 to 60 minutes for batch-ETL CDPs.
That gap is not a marginal improvement. It is the difference between catching a user while they are still on the page and sending them an email the next morning. For high-value behavioral moments — peak intent, frustration signals, conversion windows — the timing of the trigger determines whether the message is relevant or noise.
Practically, the 50–150ms budget breaks down as follows:
- ▸Event emission from SDK to Kafka ingestion: ~5–20ms (network)
- ▸Stream processor state update and feature write: ~20–50ms (stateful operator + RocksDB write)
- ▸Feature Store propagation to read replicas: ~10–30ms (async replication)
- ▸Segment evaluation and dashboard update: ~15–50ms (rule evaluation against fresh feature values)
Total: 50–150ms end-to-end, visible in the segment dashboard in real time.
For downstream activation — push notification dispatch, email trigger, webhook to ad network — add the latency of the downstream channel. Push delivery adds 50–200ms. Email dispatch adds 100–500ms. The behavioral targeting decision itself is never the bottleneck.
The Architecture Decision
The fundamental trade-off is this: warehouse-centric CDPs optimize for query flexibility and analytical completeness at the cost of latency. Stream-first architectures optimize for decision latency and write availability at the cost of some query complexity for historical analytics.
For real-time behavioral targeting, the trade-off is clear. The decision engine does not need to run ad-hoc historical queries — it needs the current behavioral state of a specific user, now. The AP-weighted Feature Store with eventual consistency and 500ms stale tolerance is structurally the right fit for that access pattern.
If you need historical analytics alongside real-time decisions, run both: stream to the Feature Store for decisions, and also sink events to S3 for cold analytical queries via Athena or Spark. The two access patterns have different latency and consistency requirements. They should use different storage systems — not be forced to share a warehouse that serves neither well.
The data warehouse is not wrong. It is just being asked to do something it was not designed for.