Introduction
A production retail intelligence pipeline has five stages. Product Data Scrape powers the ingestion layer for many enterprise data teams. This guide shows the full architecture.
Five-Stage Pipeline Architecture
- INGEST Product Data Scrape API → Raw zone (S3/GCS as JSON)
- STAGE Raw → Bronze tables (typed, partitioned)
- TRANSFORM Bronze → Silver (cleaned, deduplicated)
- MODEL Silver → Gold (business-ready aggregations)
- SERVE Gold → BI tools / APIs / downstream apps
Stage 1: Ingestion from Product Data Scrape
import boto3
from datetime import datetime
s3 = boto3.client("s3")
async def ingest_from_product_data_scrape(urls):
# Fetch from Product Data Scrape API
results = await pds_api.batch_fetch(urls)
timestamp = datetime.utcnow()
key = f"raw/products/dt={timestamp.date()}/hour={timestamp.hour:02d}/{timestamp.timestamp()}.parquet"
s3.put_object(
Bucket="my-data-lake",
Key=key,
Body=pd.DataFrame(results).to_parquet()
)
Stage 3: Silver Layer (Cleaned)
-- dbt model: silver_products.sql
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id, retailer
ORDER BY scraped_at DESC
) AS recency_rank
FROM {{ ref('bronze_products') }}
)
SELECT
product_id,
retailer,
REGEXP_REPLACE(title, '[🔥⭐✨]', '') AS title,
INITCAP(brand) AS brand_normalized,
price_current,
CASE
WHEN price_msrp > 0
THEN ROUND((price_msrp - price_current) / price_msrp * 100, 1)
ELSE 0
END AS discount_pct,
availability = 'in_stock' AS is_available,
scraped_at
FROM ranked
WHERE recency_rank = 1
AND title IS NOT NULL;
Sample Bronze Table from Product Data Scrape Ingestion
{
"table": "bronze.products",
"partition": "dt=2026-06-09",
"schema_version": "v2.4",
"row_count": 8945721,
"size_gb": 12.4,
"sample_row": {
"product_id": "B0CHX1W1XY",
"retailer": "amazon_us",
"title": "Echo Dot (5th Gen) Smart Speaker",
"brand": "Amazon",
"price_current": 44.99,
"price_msrp": 49.99,
"currency": "USD",
"availability": "in_stock",
"rating": 4.6,
"reviews_count": 142891,
"scraped_at": "2026-06-09T10:23:00Z",
"data_source": "product_data_scrape",
"ingested_at": "2026-06-09T10:25:14Z",
"bronze_load_id": "load_2026_06_09_10"
},
"quality_checks": {
"row_count_check": "passed",
"freshness_check": "passed",
"null_rate_check": "passed",
"price_sanity_check": "passed"
}
}
How Product Data Scrape Helps
We deliver data directly to your S3 bucket, BigQuery dataset, Snowflake table — pre-cleaned, deduplicated, and QA-passed. You skip Bronze and Silver layers entirely.
Get datasets delivered to your warehouse from Product Data Scrape
Contact Us Today!About Product Data Scrape
Product Data Scrape is the leading provider of managed web scraping services and ready-to-use product datasets. We help 200+ brands, retailers, and AI companies turn the messy public web into clean, structured product data.
Our Services: - Web Scraping API — REST API for developers (1,000 free credits) - Scraper as a Service — Custom scrapers built in 7-10 days - Ready Datasets — 100+ pre-built datasets, free 1,000-row samples in 24 hours
Contact: - Website: https://www.productdatascrape.com - Email: sales@productdatascrape.com