The CRM has one version of the customer. The billing system has another. The support tool has a third. Nobody notices until the CEO asks for active customer count and gets three answers.
That is what data silos feel like from the inside. Three honest people in a room, each with a dashboard, each confident their number is right, all of them partially correct. Definitions drift. Source systems diverge. Reconciliation work becomes someone's full-time job.
Gartner estimates poor data quality costs the average organization $12.9 million per year (2022). IDC's research on knowledge workers found the typical employee spends 19% of their workweek searching for and consolidating information (2023). McKinsey's operations practice puts the total tax on company capacity from split systems and manual reconciliation at 20% to 30% of output. That is enough drag to decide whether a growth-stage company hits its plan.
This piece covers how silos form, why they survive, and the patterns that unify the data without a two-year rebuild.
What is a data silo?
A data silo is a body of data owned and accessed by one part of a business, cut off from the systems used by other parts. The CRM is a silo. The financial ledger is a silo. The product analytics database is a silo. Each holds data other teams need, in a format only its owners understand, governed by rules only its owners set.
Silos are not inherently bad. Finance wants auditability and immutable history. Growth marketing wants speed and throwaway cohorts. Those requirements do not belong in the same database. The failure mode is when silos stop talking to each other and the company cannot answer basic cross-functional questions without a meeting and a spreadsheet.
Forrester's research on enterprise data strategy finds that the average large enterprise maintains 367 distinct software applications (2023). At mid-market scale, a $100M revenue company typically runs 60 to 100 operational SaaS tools. Without a plan to unify the data, every tool becomes a silo by default.
Why do data silos form?
Data silos form through four repeatable patterns: tool sprawl, departmental independence, M&A legacy, and custom-built systems that outlive their creators. Most growth companies have all four happening at once.
Tool sprawl is the most common origin. Sales picks Salesforce. Marketing picks HubSpot. Customer success picks Gainsight. Support picks Zendesk. Finance picks NetSuite. Each tool ships with its own data model, its own customer ID scheme, and its own fields. Each is correct inside its own boundary. Nothing reconciles outside of it. Gartner's 2024 research on SaaS proliferation found the average company adds 28% more applications year over year without retiring equivalent volume.
Departmental independence keeps sprawl alive. Each department owns its tool, its budget, and its definitions. Sales leadership does not want RevOps rewriting opportunity stages. The CFO does not want engineering touching the close calendar. The autonomy is efficient locally and expensive at the company level.
M&A legacy is the acute version. Two full stacks collide: two CRMs, two ERPs, two support tools, two warehouses. Integration is always funded for less time than it takes. Five years later, the acquired business still runs on its original systems.
Custom-built systems are the quietest cause. An engineer built an internal tool in 2018 to solve a workflow the SaaS market did not support. The tool works. The engineer left. No one understands the schema, and the data inside it is the authoritative source for a critical business process. Replacing it is a six-month project nobody will sponsor.
Why do silos survive?
Silos survive because the people closest to them have good reasons to protect them. Departmental ownership, switching cost, and organizational fear of change are the reasons they persist long after they have become expensive.
Departmental ownership is the first line of defense. The team that owns the tool knows how it works and what reports depend on current behavior. Unifying data requires surfacing those conventions and giving up local control. That feels like losing, even when the company-level outcome is a win.
Switching costs are real and underestimated. A warehouse migration is not a weekend project. An ELT pipeline replacement touches every downstream dashboard. A CRM consolidation disrupts sales workflows during the quarter it happens. Snowflake's mid-market case studies report 3 to 9 month timelines for initial warehouse stand-up, plus continuous work to onboard sources.
Fear of change is often rational. Every team has seen a data project that shipped late, broke reporting, or delivered a warehouse nobody trusts. The fix is not to dismiss the fear. It is to ship unification projects that produce visible wins in weeks, so the next phase gets sponsored.
The cost of staying put
Silos rarely get cheaper to unify over time. Every new tool in the stack, every new hire who learns the workarounds, and every new report built on a flawed source adds to the cost of eventual unification. Companies that delay a warehouse investment past $30M in revenue typically spend 2x to 3x more when they finally commit, according to analysis from Databricks and Snowflake customer data.
What does a unified data stack actually look like?
A unified data stack pulls data from every operational system into a central warehouse, transforms it into shared definitions, and pushes the clean results back into the tools where work happens. The core components are ELT pipelines, a cloud data warehouse, a transformation layer (typically dbt), a semantic layer, and reverse ETL for activation. Master data management governs the definitions that matter most, and data contracts keep upstream schema changes from silently breaking downstream work.
The silo-to-clarity playbook
The modern data stack collapses the old separation between data engineering and analytics. Warehouses are cheap enough to treat as the default store. Transformation happens in SQL, versioned in Git. A company of 200 people can stand up a credible version of this stack in eight to twelve weeks, if the scope stays tight.
Which cloud data warehouse is right?
The four dominant cloud data warehouses are Snowflake, Google BigQuery, Databricks, and Amazon Redshift. Each is built around a different pattern. The right choice depends on workload shape, existing cloud commitments, and how much data science runs alongside analytics.
| Warehouse | Best Fit | Pricing Model | Strengths | Trade-offs |
|---|---|---|---|---|
| Snowflake | Multi-source analytics, moderate data engineering appetite, cross-cloud flexibility | Separated storage and compute, per-second billing on virtual warehouses | Clean separation of compute, strong governance, multi-cloud, easy to share data across organizations | Compute costs climb fast under heavy concurrency; some advanced ML features trail Databricks |
| Google BigQuery | Teams already on GCP, ad-hoc analytics, large scan workloads | Serverless; pay per TB scanned or flat-rate slots | No infrastructure to manage, fast for large ad-hoc queries, tight integration with GA4 and Google Ads | Costs are unpredictable under exploratory use; less flexible for incremental transformation patterns |
| Databricks | Data science plus analytics, large-scale ML, lakehouse architecture | DBU-based compute, separate from storage | Lakehouse approach unifies data engineering and ML, strong for Spark workloads, Delta Lake format | Steeper learning curve for pure analytics teams; SQL-first users often find Snowflake simpler |
| Amazon Redshift | Existing AWS-heavy stacks, predictable workloads, cost-sensitive | Provisioned clusters or serverless | Deep AWS integration, competitive pricing for steady workloads, RA3 nodes separate storage and compute | Cluster management adds operational overhead; ecosystem tooling lags Snowflake and Databricks |
Gartner's 2024 Magic Quadrant for Cloud Database Management Systems named Snowflake, Databricks, Google, Microsoft, and AWS as leaders. For a cloud-agnostic growth company that wants the lowest learning curve, Snowflake is usually the default. If data science sits at the center of the product, Databricks is the better pick. If the company is already deep on GCP or AWS, BigQuery or Redshift cuts coordination cost.
The warehouse is less load-bearing than teams think. A well-run dbt project on Redshift beats a neglected dbt project on Snowflake. The transformation and semantic layers decide day-to-day clarity.
What is a semantic layer and why does it matter?
A semantic layer is a governed definition of business metrics that sits between the data warehouse and the tools people use to ask questions. Instead of every BI tool, dashboard, and notebook recomputing "monthly recurring revenue" from raw SQL, the semantic layer defines MRR once and exposes it everywhere. One answer, regardless of who asks.
The leading options are dbt Semantic Layer, Cube, and Transform (acquired by dbt Labs in 2023). dbt Semantic Layer hooks into the transformation work a team is already doing in dbt. Cube is a standalone headless BI layer with strong caching, popular with teams that embed analytics in their own products.
The reason the semantic layer matters is political, not technical. When sales, finance, and product each define "active customer" in their own BI tool, disputes go unresolved and the CEO ends up asking for a fourth number. When the definition lives in one place, versioned and owned, the argument happens once, at the layer where it belongs. Every downstream consumer inherits the answer.
Reverse ETL: closing the loop back to operations
Reverse ETL moves clean, unified data from the warehouse back into the operational tools where work happens. Hightouch and Census are the dominant vendors. A sales rep opening Salesforce sees customer health, product usage, and support history next to the account record because the warehouse computed it and pushed it into Salesforce.
Without reverse ETL, the warehouse is a reporting asset. With it, the warehouse is the operating brain of the company. Marketing campaigns fire on product usage. Sales reps get leads scored by behavioral signals. Customer success sees churn risk before the renewal conversation. Census and dbt Labs case studies both show measurable lift in pipeline conversion and retention when warehouse-derived signals reach operational tools in hours, not days.
Master data management and data contracts
Master data management (MDM) is the discipline of maintaining a single authoritative record for core entities like customers, products, accounts, and employees. Data contracts are the agreements between data producers and consumers that specify schema, semantics, and service-level expectations for each data feed. Together they are the governance layer that keeps a unified stack from drifting back into silos.
MDM used to mean buying a heavy enterprise platform. For most $30M to $500M companies, a lighter approach works: pick the entities that matter (usually customer and product), identify the system of record for each, and build the unification logic in dbt. The "golden record" for a customer is a modeled view that reconciles CRM, billing, and product data into one row per customer, with the source of each field documented.
Data contracts, popularized by Chad Sanderson and practitioners at Convoy and Monzo, specify the columns, types, refresh cadence, and semantic meanings a table will expose. When an upstream engineer changes a schema, the contract fails, and the break is caught before it silently corrupts a downstream dashboard.
Governance fails when it is imposed top-down with no teeth. Governance works when the people who own the data sign the contract and CI enforces it. The model is the same as API versioning: breaking changes are expensive, documented, and negotiated.
When is the cost of unification worth it?
The cost of unification is worth it when cross-functional questions routinely require more than one system and the manual reconciliation work has become a visible tax. A few signals tend to appear together, and when they do, the return on a data platform investment is almost always positive within 12 months.
The first is repeated month-end fire drills. Finance, sales, and RevOps rebuild the same numbers in spreadsheets every close because the systems disagree. The second is blocked strategic questions. Leadership wants cohort retention, CAC by segment, or expansion revenue by industry, and the answer takes two weeks of manual work every time it gets asked. The third is customer-facing failures driven by data mismatches. A sales rep pitches a feature the customer already has. A support agent quotes a price that differs from billing. A marketing email lands in a churned account's inbox.
McKinsey's operations research and IDC's studies on data-driven organizations find that companies with mature data platforms grow revenue 30% faster than peers without them. The platform does not create the revenue. It removes the friction that was stealing it.
Before unification
After unification
A mid-market company ($30M to $100M revenue) can stand up a credible modern data stack (warehouse, ELT, dbt, one semantic layer, and reverse ETL for two high-value activations) for $150K to $400K in first-year software and implementation costs. Payback is almost always under 18 months. The data strategy for operations piece covers how to scope that first investment.
Gradual unification vs big-bang rebuild
Unification can be pursued gradually or in one large rebuild. Gradual unification ships incremental wins on a live stack over 6 to 18 months. A big-bang rebuild replaces the stack in a single 12 to 24 month project. The gradual path wins for most growth companies.
Gradual unification starts with the highest-value data set, usually customer data, and builds the warehouse, transformation, and reverse ETL path for that entity alone. Once the first flow is live and trusted, the next entity follows. Product usage, then financials, then marketing touchpoints. Each phase delivers its own return, which funds the next.
Big-bang rebuild is the pattern that dominated enterprise warehouse projects in the 2010s, and it is the pattern that produced the horror stories. Two years of work. A vendor-led migration. A launch date that slips twice. The warehouse lands, and half the business still uses the old reports because they were built around the old definitions. Gartner analysts pegged the failure rate of those projects at over 50% by the measure of business-defined success.
There is a narrow case for big-bang. Post-M&A, when two incompatible stacks need to merge and the interim state is untenable, a rebuild is sometimes the only path. Highly regulated industries may need a clean start to establish audit provenance. Outside those cases, the gradual path is almost always the right call.
What is the minimum viable data platform?
A minimum viable data platform is one cloud warehouse, one ELT tool for the top three sources, one transformation framework (dbt), one BI or semantic layer, and a documented owner for the customer definition. Pointed at the highest-value data, it can be running in six to ten weeks and will deliver a defensible answer to "how many active customers do we have" within the first quarter.
Fivetran or Airbyte handles ingestion. Snowflake, BigQuery, Databricks, or Redshift is the warehouse. dbt handles transformation. A BI tool or dbt Semantic Layer serves the metrics. One data engineer, analytics engineer, or capable RevOps leader owns the first build.
The anti-pattern is scoping the first platform to solve every problem in the business. It will not. The first platform proves the pattern works and lands one or two high-value unification wins. Later quarters add sources, metrics, and activations. The operations dashboard that actually gets used piece walks through what that first consumable surface should look like.
Key takeaways
Data silos are not a technology problem. They are what departments produce when each one does its job well without a shared plan for how the data connects. The fix is structural: a central warehouse, clean ELT ingestion, versioned transformations in dbt, a governed semantic layer, and reverse ETL back into operational tools. Master data management and data contracts keep the stack from drifting back into silos.
Companies that move from silos to operational clarity tend to do three things. They scope the first platform tightly and ship it in weeks. They treat governance as a negotiation between departments, not a policy document. They make the semantic layer a first-class concern, because that is where "one number" actually lives.
The CEO will still ask hard questions. The answer should no longer depend on which system the analyst happened to check. Cross-departmental friction work starts with the data platform that makes one answer possible.
Next step
Ready to go AI-native?
Schedule 30 minutes with our team. We’ll explore where AI can drive the most value in your business.
Get in Touch