Building an ETL Pipeline for Sales Data
Your CRM holds deal data, your marketing platform holds campaign data, and your billing system holds revenue data. Getting these into one place where you can analyze them requires an ETL (Extract, Transform, Load) pipeline. This isn't a data engineering problem anymore. Modern tools make it accessible to RevOps teams without writing code. But the decisions you make in setting up the pipeline determine whether your analytics are trustworthy or full of gaps.
How to build an ETL pipeline that moves sales data from CRM to warehouse. Covers extraction, transformation, loading, and common pitfalls for RevOps teams.
ETL vs ELT: Which Approach Fits Sales Data
Traditional ETL extracts data from source systems, transforms it in transit, and loads the clean result into a destination. This worked when compute was expensive and storage was limited.
Modern practice has shifted to ELT: extract raw data, load it directly into a warehouse, then transform it inside the warehouse. Cloud warehouses like Snowflake and BigQuery make compute cheap, so it's faster and more reliable to store the raw data first and clean it later.
For sales data specifically, ELT is almost always the right choice. Your CRM exports raw objects (accounts, contacts, opportunities, activities) into the warehouse, and your transformation layer (dbt is the standard) models them into the tables your analysts need. This preserves the raw data as an audit trail, which matters when someone asks why last quarter's numbers look different from what the CRM showed.
The practical implication: when you hear 'ETL' in 2026, most tools follow the ELT pattern. Fivetran, Airbyte, and Stitch all follow this pattern. The distinction matters when you're designing your pipeline architecture, but the tools handle the mechanics.
Extraction: Getting Data Out of Your Revenue Tools
The extraction step pulls data from source systems into your warehouse. For RevOps, the critical sources are your CRM (Salesforce, HubSpot), marketing automation (Marketo, HubSpot Marketing), billing (Stripe, Chargebee), and potentially product analytics (Mixpanel, Amplitude).
Managed ELT tools handle extraction with pre-built connectors. Fivetran offers 500+ connectors, and its Salesforce connector is the most mature in the market. It handles incremental syncs (only pulling changed records), schema changes, and API rate limiting automatically. Pricing starts at $1 per monthly active row for the standard plan, which means costs scale with your data volume.
Airbyte is the open-source alternative. You can self-host it for free or use their cloud product starting at $10 per credit. Connector quality varies. The Salesforce and HubSpot connectors are production-grade. Less common sources may have community-built connectors that need more testing.
For most RevOps teams, a managed tool (Fivetran or Airbyte Cloud) is the right choice. Building custom extraction scripts with Python and the CRM API is technically possible, but maintaining those scripts when APIs change, rate limits shift, or schema updates happen creates ongoing engineering work that a managed tool handles automatically.
Sync frequency matters for sales data. Daily syncs are sufficient for pipeline reporting and weekly reviews. Hourly syncs are needed for real-time dashboards or triggered workflows. Real-time syncs (available on Fivetran's higher tiers) are rarely necessary for RevOps unless you're powering operational workflows from warehouse data.
Transformation: Turning Raw CRM Data Into Useful Models
Raw CRM data is messy. Salesforce exports include system fields, picklist values that changed over time, deleted records, and relationship tables that need joining. Transformation turns this raw export into clean, analysis-ready tables.
dbt (data build tool) is the standard for transformation in modern data stacks. It lets you write SQL models that define how raw tables become clean ones. For RevOps, common transformations include:
Opportunity pipeline snapshots. Your CRM stores the current state of each deal. To analyze stage progression over time, you need to snapshot the pipeline daily and store historical states. A dbt model captures this, enabling stage duration analysis and conversion rate trends.
Lead-to-revenue mapping. Connecting a marketing lead to the opportunity it influenced to the revenue it generated requires joining records across multiple objects with complex attribution logic. This join is too complex for CRM-native reporting but straightforward in dbt.
Metric definitions. What counts as a 'qualified opportunity'? What's the definition of 'sales cycle length'? Encoding these definitions in dbt models ensures every dashboard uses the same logic. When definitions change, you update one model and every downstream report reflects the change.
The RevOps-specific dbt packages (like the Fivetran Salesforce dbt package) provide pre-built models for common use cases. They handle Salesforce object relationships, field history tracking, and standard metrics. Starting with these packages and customizing them saves weeks of development compared to building from scratch.
Loading and Storage: Choosing the Right Warehouse
The three dominant cloud warehouses are Snowflake, Google BigQuery, and Amazon Redshift. For RevOps teams, the choice often comes down to existing cloud provider relationships and pricing model preferences.
Snowflake separates compute from storage, which means you pay for storage and compute independently. This works well for RevOps use cases where data volumes are moderate but queries can be complex. Snowflake's auto-scaling handles sporadic usage patterns (heavy queries during business hours, idle at night) without manual tuning. Starting costs for a small RevOps use case run $50-200 per month.
BigQuery uses a pay-per-query model in its on-demand tier. You pay for the bytes your queries scan, not for running compute instances. For teams that run a handful of dashboard refreshes and ad-hoc queries per day, this can be significantly cheaper than Snowflake. BigQuery is the natural choice if you're already in the Google Cloud ecosystem. On-demand pricing makes it nearly free for small workloads.
Redshift is the default for AWS-heavy organizations. Its Serverless option simplified what used to be a complex provisioning process. Performance is competitive, but the ecosystem of tools and dbt packages tends to be better tested on Snowflake and BigQuery.
For a RevOps team making this choice for the first time: BigQuery if you want the lowest cost at small scale, Snowflake if you want the broadest tool ecosystem, Redshift if your company already runs on AWS and your data team prefers it.
Common Pipeline Failures and How to Prevent Them
The most frequent pipeline failure is schema changes in the source system. When someone adds a custom field to Salesforce or changes a picklist value, it can break your extraction and transformation layers. Managed ELT tools like Fivetran handle schema changes automatically for extraction. But your dbt models may reference specific field names that changed. Build your dbt models to be resilient to new fields (don't use SELECT * in production models) and set up alerts for schema drift.
API rate limiting causes silent data gaps. Salesforce's API has daily limits that depend on your edition and number of licenses. A full initial sync of a large Salesforce instance can consume days of API capacity. Fivetran and Airbyte manage rate limiting, but during the initial sync, plan for 2-3 days of extraction before the pipeline is fully loaded.
Data freshness mismatches create confusion. If your CRM data syncs hourly but your billing data syncs daily, a dashboard that joins both will show inconsistencies during the gap. Document the sync schedule for each source and include 'data last updated' timestamps on every dashboard.
Duplicate records in the CRM produce duplicate records in the warehouse. Your pipeline inherits your CRM's data quality problems. Deduplication should happen at the CRM level (using tools like DemandTools or LeanData) before the data hits your warehouse. Trying to deduplicate in the transformation layer is possible but fragile.
The simplest safeguard is a data quality monitor that runs after each pipeline refresh. Tools like Elementary, Monte Carlo, or even custom dbt tests check for row count anomalies, null rates, and freshness. When something breaks, you find out in minutes instead of when an executive questions a dashboard number.
A Practical Implementation Timeline
Week 1: Choose your warehouse and ELT tool. Sign up for Fivetran or Airbyte Cloud and connect your CRM as the first source. Run the initial sync. Connect a BI tool (Looker, Tableau, or even Metabase for a lighter option) to the warehouse.
Week 2: Add your second and third data sources (marketing automation and billing). While these sync, install the relevant dbt packages for your CRM and start exploring the pre-built models. Customize the opportunity staging model to match your pipeline stages.
Week 3: Build your first three cross-system reports. Start with pipeline-to-revenue analysis (CRM + billing), marketing attribution (CRM + marketing), and sales activity metrics (CRM activities + engagement data). These three reports demonstrate the value of the warehouse to stakeholders.
Week 4: Add data quality tests and monitoring. Set up freshness checks, row count alerts, and null rate thresholds for critical fields. Document your metric definitions in a shared doc or wiki. Present the initial dashboards to leadership.
Ongoing: Add new data sources as needs arise. Refine transformation models as business logic changes. Review pipeline health weekly. The maintenance burden for a well-built pipeline is 2-4 hours per week for a RevOps analyst. If it's taking more than that, the initial architecture needs improvement.
Total cost for a mid-market implementation: $1,500-4,000 per month for tools (ELT + warehouse + BI), plus 20-40 hours of initial setup time from a RevOps analyst or data-savvy ops person.
Tools Mentioned in This Guide
Related Categories
Frequently Asked Questions
What is the difference between ETL and ELT for sales data?
ETL transforms data before loading it into the destination. ELT loads raw data first, then transforms it inside the warehouse. ELT is the modern standard because cloud warehouses make in-warehouse transformation fast and cheap, and keeping raw data provides an audit trail.
Do I need a data engineer to build a sales data pipeline?
Not necessarily. Tools like Fivetran handle extraction without code, and dbt packages provide pre-built transformation models. A RevOps analyst with SQL skills can build and maintain a basic pipeline. Complex custom models or high-volume pipelines may need engineering support.
How often should sales data sync from CRM to warehouse?
Daily syncs work for most reporting and analysis. Hourly syncs are needed for operational dashboards that inform same-day decisions. Real-time syncs are rarely necessary for RevOps and add significant cost.
What does a basic sales data pipeline cost?
A mid-market setup (Fivetran + Snowflake or BigQuery + a BI tool) runs $1,500-4,000 per month. BigQuery's pay-per-query model can reduce warehouse costs significantly for smaller workloads. The biggest hidden cost is the initial setup time: 20-40 hours from a qualified person.