Progressive Data Enrichment Pipeline with n8n
Build cost-efficient data enrichment workflows in n8n using waterfall logic to balance data quality with API costs across multiple providers.
Progressive Data Enrichment Pipeline with n8n
Last quarter, I watched a client burn through $4,000 in Clearbit credits in three weeks because they were enriching every single form fill—including job seekers, students, and spam submissions. Their enrichment “strategy” was essentially firing expensive API calls at anything with an email address.
We rebuilt their enrichment pipeline using progressive waterfall logic: start with free sources, escalate to paid APIs only when necessary, and skip enrichment entirely for low-value prospects. Their enrichment costs dropped 73% while data quality actually improved.
This is the power of intelligent enrichment architecture—and n8n is the perfect platform to build it.
Why Most Enrichment Strategies Waste Money
The “Enrich Everything” Fallacy Many teams treat enrichment as an all-or-nothing decision. But enriching a personal Gmail address the same way you enrich an enterprise Fortune 500 prospect is like using a firehose to water a houseplant.
The Single-Source Dependency Relying on one enrichment provider (Clearbit, ZoomInfo, Apollo) means you’re locked into their pricing, coverage gaps, and data freshness issues. When that API goes down or hits rate limits, your entire enrichment pipeline stops.
The Timing Problem Enriching leads immediately at capture wastes credits on prospects who’ll never engage. But waiting too long means your sales team is working with incomplete data during critical early outreach.
The ENRICH Framework for n8n
After building enrichment pipelines for dozens of clients, I’ve developed the ENRICH framework:
- Evaluate lead quality before enrichment
- Native data sources first (free/cheap)
- Rate limiting and cost controls
- Intelligent provider waterfall
- Caching and deduplication
- Handoff and CRM updates
Architecture: The Waterfall Enrichment Model
Progressive enrichment works like a waterfall—each tier catches what the previous tier missed:
Tier 0: Pre-Enrichment Validation (Free)
- Email validation
- Domain categorization (corporate vs. personal)
- Spam detection
- Lead scoring
Tier 1: Free Public Sources (Free)
- Company website scraping
- LinkedIn public profiles
- WHOIS data
- DNS/SPF records
- Social media APIs (limited data)
Tier 2: Budget Enrichment APIs ($0.01-0.05 per lookup)
- Hunter.io for email finding
- FullContact for social profiles
- IPinfo for company location
- Custom scrapers
Tier 3: Premium Enrichment ($0.50-2.00 per lookup)
- Clearbit for firmographics
- ZoomInfo for org charts
- Apollo for technographics
- PredictLeads for funding data
Tier 4: Manual Research (High-value only)
- SDR manual research
- Intent data platforms
- Custom investigative research
Building the Foundation in n8n
Module 1: Intake and Initial Validation
Every enrichment workflow starts by determining if enrichment is even warranted:
Webhook Node Configuration:
{
"method": "POST",
"path": "lead-enrichment",
"response_mode": "immediately",
"expected_fields": ["email", "company", "source"]
}
Email Validation Node (Function):
// n8n Code Node
const email = $input.item.json.email;
const domain = email.split('@')[1];
// Personal email domains - skip enrichment
const personalDomains = [
'gmail.com', 'yahoo.com', 'hotmail.com',
'outlook.com', 'aol.com', 'icloud.com'
];
// Validation checks
const isPersonalEmail = personalDomains.includes(domain);
const isValidFormat = /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
const isDisposable = await checkDisposableEmail(domain);
return {
email: email,
domain: domain,
isPersonalEmail: isPersonalEmail,
isValidFormat: isValidFormat,
isDisposable: isDisposable,
enrichmentPriority: isPersonalEmail || isDisposable ? 'skip' : 'proceed'
};
Lead Scoring Node: Calculate enrichment priority before spending credits:
// n8n Function Node
const lead = $input.item.json;
let score = 0;
// Source quality
const sourceScores = {
'demo-request': 30,
'pricing-page': 25,
'webinar': 20,
'content-download': 10,
'newsletter': 5
};
score += sourceScores[lead.source] || 0;
// Company size indicators (from domain)
if (lead.domain && await checkCompanySize(lead.domain) === 'enterprise') {
score += 30;
}
// Behavioral signals
if (lead.pageViews > 5) score += 15;
if (lead.timeOnSite > 300) score += 10;
// Enrichment decision
let enrichmentTier = 'skip';
if (score >= 70) enrichmentTier = 'tier3'; // Premium
else if (score >= 40) enrichmentTier = 'tier2'; // Budget
else if (score >= 20) enrichmentTier = 'tier1'; // Free
else enrichmentTier = 'skip';
return { ...lead, enrichmentScore: score, enrichmentTier: enrichmentTier };
Module 2: Deduplication and Cache Checking
Never pay to enrich the same domain twice:
Check Cache Node (MySQL/PostgreSQL):
SELECT * FROM enrichment_cache
WHERE domain = '{{$json.domain}}'
AND cache_age < INTERVAL 30 DAY
LIMIT 1;
Cache Logic (IF Node):
If cache found and < 30 days old:
→ Use cached data
Else:
→ Proceed to enrichment waterfall
→ Store results in cache after enrichment
Cache Table Schema:
CREATE TABLE enrichment_cache (
id INT PRIMARY KEY AUTO_INCREMENT,
domain VARCHAR(255) UNIQUE,
company_name VARCHAR(255),
employee_count INT,
industry VARCHAR(100),
revenue_range VARCHAR(50),
technologies JSON,
enrichment_tier VARCHAR(20),
cost_usd DECIMAL(10,4),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
Tier 1: Free Public Sources
Extract maximum value from free data sources before spending money:
Company Website Scraping
HTTP Request Node - Fetch Company Website:
{
"method": "GET",
"url": "https://{{$json.domain}}",
"options": {
"timeout": 10000,
"redirect": {
"followRedirects": true
}
}
}
Parse Metadata (Function Node):
// n8n Code Node
const html = $input.item.json.data;
const cheerio = require('cheerio');
const $ = cheerio.load(html);
// Extract company info from metadata
const companyName = $('meta[property="og:site_name"]').attr('content') ||
$('title').text().split('|')[0].trim();
const description = $('meta[name="description"]').attr('content');
// Look for employee count hints
const aboutText = $('body').text();
const employeeMatch = aboutText.match(/(\d+)\+?\s*(employees|team members)/i);
const employeeCount = employeeMatch ? parseInt(employeeMatch[1]) : null;
// Extract social media links
const linkedinUrl = $('a[href*="linkedin.com"]').attr('href');
const twitterUrl = $('a[href*="twitter.com"]').attr('href');
// Technology detection
const technologies = [];
if (html.includes('Shopify.analytics')) technologies.push('Shopify');
if (html.includes('google-analytics.com')) technologies.push('Google Analytics');
if ($('script[src*="hubspot"]').length) technologies.push('HubSpot');
return {
companyName,
description,
employeeCount,
socialLinks: { linkedin: linkedinUrl, twitter: twitterUrl },
technologies,
dataSource: 'website_scrape'
};
DNS and Domain Intelligence
DNS Lookup Node:
// n8n Function Node using DNS library
const dns = require('dns').promises;
const domain = $input.item.json.domain;
const [mxRecords, txtRecords] = await Promise.all([
dns.resolveMx(domain),
dns.resolveTxt(domain)
]);
// Detect email provider
const emailProvider = mxRecords[0]?.exchange.includes('google') ? 'Google Workspace' :
mxRecords[0]?.exchange.includes('outlook') ? 'Microsoft 365' :
mxRecords[0]?.exchange.includes('mail.protection.outlook') ? 'Microsoft 365' :
'Other';
// Extract SPF record for tech stack hints
const spfRecord = txtRecords.find(record =>
record.toString().includes('v=spf1')
);
const techStack = [];
if (spfRecord?.includes('hubspot')) techStack.push('HubSpot');
if (spfRecord?.includes('salesforce')) techStack.push('Salesforce');
if (spfRecord?.includes('sendgrid')) techStack.push('SendGrid');
return {
emailProvider,
mxRecords,
techStack,
dataSource: 'dns_lookup'
};
LinkedIn Company Scraping (Public Data)
HTTP Request to LinkedIn:
{
"method": "GET",
"url": "https://www.linkedin.com/company/{{$json.companyLinkedinSlug}}",
"options": {
"headers": {
"User-Agent": "Mozilla/5.0..."
}
}
}
Parse LinkedIn Data:
// Extract visible public data (no authentication required)
const $ = cheerio.load(html);
const companySize = $('.org-top-card-summary-info-list__info-item')
.filter((i, el) => $(el).text().includes('employees'))
.text()
.match(/[\d,]+-?[\d,]*/)?.[0];
const industry = $('.org-top-card-summary-info-list__info-item')
.first()
.text()
.trim();
const followerCount = $('.org-top-card-secondary-content__follower-count')
.text()
.match(/[\d,]+/)?.[0];
return { companySize, industry, followerCount, dataSource: 'linkedin_public' };
Tier 2: Budget Enrichment APIs
When free sources don’t provide enough data, escalate to affordable APIs:
Hunter.io for Email Finding
Hunter.io HTTP Node:
{
"method": "GET",
"url": "https://api.hunter.io/v2/domain-search",
"qs": {
"domain": "{{$json.domain}}",
"api_key": "{{$credentials.hunter_api_key}}",
"limit": 10
}
}
Cost Tracking:
// Log API usage for cost monitoring
const hunterCost = 0.04; // $0.04 per domain search
await $execution.addMetadata({
enrichmentProvider: 'Hunter.io',
cost: hunterCost,
recordsFound: $input.item.json.meta.results
});
// Update running cost total
let totalCost = $executionState.get('totalEnrichmentCost') || 0;
totalCost += hunterCost;
$executionState.set('totalEnrichmentCost', totalCost);
FullContact for Social Profile Enrichment
FullContact Person API:
{
"method": "POST",
"url": "https://api.fullcontact.com/v3/person.enrich",
"headers": {
"Authorization": "Bearer {{$credentials.fullcontact_token}}"
},
"body": {
"email": "{{$json.email}}"
}
}
Tier 3: Premium Enrichment
Reserve expensive APIs for high-value leads only:
Clearbit Enrichment
Clearbit Company API Node:
{
"method": "GET",
"url": "https://company-stream.clearbit.com/v2/companies/find",
"qs": {
"domain": "{{$json.domain}}"
},
"headers": {
"Authorization": "Bearer {{$credentials.clearbit_key}}"
}
}
Selective Enrichment Logic:
// Only call Clearbit if:
// 1. Lead score >= 70
// 2. Free + budget sources returned insufficient data
// 3. Monthly Clearbit budget not exceeded
const shouldEnrich =
$json.enrichmentScore >= 70 &&
!$json.employeeCount && // Missing critical data
$executionState.get('clearbitSpendThisMonth') < 500; // Under budget
if (shouldEnrich) {
return $input.items; // Proceed to Clearbit
} else {
return []; // Skip expensive enrichment
}
Intelligent Waterfall Logic
The magic is in the conditional flow between tiers:
n8n Switch Node Pattern:
Tier 1 Complete → Evaluate Completeness
├─ If data complete: Skip to CRM Update
├─ If data partial & score >= 40: Proceed to Tier 2
└─ If data partial & score >= 70: Skip to Tier 3
Tier 2 Complete → Evaluate Completeness
├─ If data complete: Skip to CRM Update
└─ If data partial & score >= 70: Proceed to Tier 3
Tier 3 Complete → Proceed to CRM Update
Data Completeness Function:
// Calculate data completeness percentage
const requiredFields = [
'companyName', 'industry', 'employeeCount',
'revenue', 'location', 'technologies'
];
const completedFields = requiredFields.filter(field =>
$json[field] !== null && $json[field] !== undefined && $json[field] !== ''
);
const completeness = (completedFields.length / requiredFields.length) * 100;
return {
...$json,
dataCompleteness: completeness,
shouldContinueEnrichment: completeness < 80 // Continue if <80% complete
};
Rate Limiting and Cost Controls
Prevent runaway API costs with built-in governors:
Rate Limit Implementation
Rate Limiter Function Node:
// Check rate limits from state
const now = Date.now();
const windowStart = $executionState.get('rateLimitWindowStart') || now;
const requestCount = $executionState.get('rateLimitCount') || 0;
const windowDuration = 60000; // 1 minute
// Reset window if expired
if (now - windowStart > windowDuration) {
$executionState.set('rateLimitWindowStart', now);
$executionState.set('rateLimitCount', 0);
return $input.items; // Allow request
}
// Check if under limit (e.g., 100 requests per minute)
if (requestCount < 100) {
$executionState.set('rateLimitCount', requestCount + 1);
return $input.items; // Allow request
} else {
// Rate limit exceeded - queue for retry
await queueForRetry($json, windowDuration - (now - windowStart));
return []; // Block this request
}
Daily Budget Caps
Cost Control Node:
// Track daily enrichment spend
const today = new Date().toISOString().split('T')[0];
const dailyBudgetKey = `enrichment_cost_${today}`;
const dailySpend = await $redis.get(dailyBudgetKey) || 0;
const dailyBudgetLimit = 100; // $100 per day
if (parseFloat(dailySpend) >= dailyBudgetLimit) {
// Budget exceeded - defer to tomorrow or queue for approval
await slackAlert(`Enrichment budget limit reached: $${dailySpend}/$${dailyBudgetLimit}`);
return []; // Block enrichment
} else {
return $input.items; // Allow enrichment
}
CRM Integration and Data Handoff
HubSpot Update Node
Update Contact with Enriched Data:
{
"module": "HubSpot - Update Contact",
"email": "{{$json.email}}",
"properties": {
"company": "{{$json.companyName}}",
"industry": "{{$json.industry}}",
"numberofemployees": "{{$json.employeeCount}}",
"annualrevenue": "{{$json.revenue}}",
"website": "{{$json.domain}}",
"data_enrichment_date": "{{$now}}",
"data_enrichment_tier": "{{$json.enrichmentTier}}",
"data_enrichment_cost": "{{$json.totalEnrichmentCost}}",
"data_completeness_score": "{{$json.dataCompleteness}}",
"technologies": "{{$json.technologies.join(', ')}}"
}
}
Salesforce Update with Field Mapping
// Map enriched data to Salesforce fields
const salesforceMapping = {
'Company': $json.companyName,
'NumberOfEmployees': $json.employeeCount,
'Industry': $json.industry,
'AnnualRevenue': parseRevenue($json.revenue),
'Website': $json.domain,
'Enrichment_Date__c': new Date().toISOString(),
'Enrichment_Tier__c': $json.enrichmentTier,
'Enrichment_Cost__c': $json.totalEnrichmentCost,
'Data_Quality_Score__c': $json.dataCompleteness
};
return { salesforceMapping };
Monitoring and Optimization
Enrichment Analytics Dashboard
Track these key metrics in your n8n database:
CREATE TABLE enrichment_metrics (
id INT AUTO_INCREMENT PRIMARY KEY,
date DATE,
total_leads_processed INT,
tier0_validated INT,
tier1_enriched INT,
tier2_enriched INT,
tier3_enriched INT,
skipped_low_quality INT,
total_cost_usd DECIMAL(10,2),
avg_completeness_score DECIMAL(5,2),
avg_cost_per_lead DECIMAL(10,4),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Daily Metrics Aggregation Workflow:
// Run this as a scheduled n8n workflow daily
const yesterday = new Date();
yesterday.setDate(yesterday.getDate() - 1);
const metrics = await database.query(`
SELECT
COUNT(*) as total_leads,
SUM(CASE WHEN tier = 'tier1' THEN 1 ELSE 0 END) as tier1_count,
SUM(CASE WHEN tier = 'tier2' THEN 1 ELSE 0 END) as tier2_count,
SUM(CASE WHEN tier = 'tier3' THEN 1 ELSE 0 END) as tier3_count,
SUM(cost) as total_cost,
AVG(completeness) as avg_completeness
FROM enrichment_logs
WHERE DATE(created_at) = '${yesterday.toISOString().split('T')[0]}'
`);
// Send daily report to Slack
await slack.postMessage({
channel: '#revops-metrics',
text: `📊 Enrichment Report - ${yesterday.toDateString()}
Total Processed: ${metrics.total_leads}
Free Sources: ${metrics.tier1_count}
Budget APIs: ${metrics.tier2_count}
Premium APIs: ${metrics.tier3_count}
Total Cost: $${metrics.total_cost.toFixed(2)}
Avg Cost/Lead: $${(metrics.total_cost / metrics.total_leads).toFixed(2)}
Avg Data Quality: ${metrics.avg_completeness.toFixed(1)}%`
});
Advanced Patterns
Batch Processing for Cost Efficiency
Some APIs offer batch discounts:
// Accumulate leads in a buffer until batch size reached
const batchSize = 25;
const currentBatch = $executionState.get('enrichmentBatch') || [];
currentBatch.push($json);
if (currentBatch.length >= batchSize) {
// Process batch
$executionState.set('enrichmentBatch', []);
return currentBatch; // Send entire batch to next node
} else {
// Wait for more leads
$executionState.set('enrichmentBatch', currentBatch);
return []; // Don't proceed yet
}
Conditional Re-Enrichment
Automatically refresh stale data:
// Check when lead was last enriched
const lastEnrichment = new Date($json.enrichmentDate);
const daysSinceEnrichment = (Date.now() - lastEnrichment) / (1000 * 60 * 60 * 24);
// Re-enrich if:
// - Data is >90 days old AND lead recently engaged
// - Data is >180 days old (any lead)
// - Lead score increased significantly
const shouldReEnrich =
(daysSinceEnrichment > 90 && $json.recentEngagement) ||
daysSinceEnrichment > 180 ||
($json.currentScore - $json.scoreAtEnrichment > 30);
if (shouldReEnrich) {
return $input.items; // Proceed to enrichment
} else {
return []; // Skip re-enrichment
}
FAQ
Q: How do I decide which leads deserve expensive enrichment? A: Use a scoring model that considers source quality, behavioral signals, and company size indicators. Only send leads scoring 70+ to premium APIs. For B2B SaaS, demo requests from corporate domains warrant premium enrichment, while ebook downloads from personal emails don’t.
Q: What’s a reasonable enrichment budget for a startup? A: Start with $200-500/month and adjust based on lead volume and conversion rates. Calculate your cost per SQL (Sales Qualified Lead) including enrichment costs. If enrichment adds $2 per lead but increases qualification accuracy by 40%, that’s usually a good trade.
Q: How do I handle API rate limits across multiple providers? A: Implement a token bucket algorithm in n8n using execution state. Track request counts per provider separately. When approaching limits, either queue requests for the next window or switch to an alternative provider for that data point.
Q: Should I enrich leads immediately or wait until they show engagement? A: Use a hybrid approach: do lightweight Tier 1 enrichment (free sources) immediately for scoring and routing. Defer expensive Tier 2-3 enrichment until leads hit engagement thresholds (email opens, multiple page visits, demo requests).
Q: How do I prevent duplicate enrichment costs? A: Implement domain-level caching with 30-90 day TTL. Before enriching any lead, check if you’ve enriched that domain recently. For companies with multiple contacts, enrich once at the domain level and apply firmographic data to all contacts.
Q: What’s the best way to measure enrichment ROI? A: Track conversion rates by enrichment tier. Calculate: (Additional revenue from enriched leads - Enrichment costs) / Enrichment costs. Also measure time saved for sales reps—if enrichment reduces research time by 15 minutes per lead, that’s quantifiable labor savings.
Q: How do I handle enrichment for international leads? A: Many US-centric enrichment APIs have poor coverage for EMEA/APAC. Build region-specific waterfalls: use Crunchbase for EU startups, use Owler for APAC companies. Always start with free website scraping—it works globally.
Progressive enrichment in n8n transforms enrichment from a cost center into a strategic advantage. Start with the free tiers, prove the value, then selectively invest in premium data where it actually drives revenue. Your CFO will love the cost efficiency, and your sales team will love the data quality.
Need Implementation Help?
Our team can build this integration for you in 48 hours. From strategy to deployment.
Get Started