Automated Data Pipeline — Proptech Industry
Replaced a 2-week manual Excel process with a serverless Apache Beam pipeline on Google Cloud Dataflow. Runs quarterly in 6 hours, unattended.
Key Results
The Problem
A company in the proptech space needed to aggregate and grade professional records from public listing APIs across the United States. Each cycle involved pulling millions of records, scoring individuals against configurable thresholds, and exporting enriched data for their sales team.
The existing process was entirely manual — a team member spent two weeks per cycle working through spreadsheets, paginating APIs by hand, and calculating grades in Excel. The process ran quarterly, and each cycle was error-prone and exhausting.
What We Built
Phase 1 — Prototype
We started with a custom Python application that dynamically paginated through the API, aggregating records per individual until configurable thresholds were met (e.g., $40M total volume for the top tier). This prototype validated the data model and API integration patterns, handling rate limiting, exponential backoff retries, and malformed response recovery.
Phase 2 — Production Pipeline
The production system was rebuilt on Apache Beam, deployed as a Google Cloud Dataflow template. This gave the client a managed, serverless pipeline that auto-scales workers based on data volume — no infrastructure to maintain.
Pipeline Architecture: 3-stage fan-out design. Stage 1 fetches profile data and calculates initial scores. Stages 2 and 3 paginate through active and historical records respectively, with iterative loop unrolling (up to 19 pages deep) and consecutive-failure circuit breakers. All stages write raw API responses to BigQuery for audit and reprocessing.
Phase 3 — Self-Service Dashboard
A Flask-based web dashboard lets the client's team trigger pipeline runs with configurable parameters: grade thresholds, date cutoffs, whether to include active records, and which tiers to process. The dashboard writes batch tracking records to BigQuery and monitors pipeline status — the team runs it themselves without engineering support.
Results
The pipeline runs quarterly with zero engineering involvement. The client configures parameters through the dashboard, triggers the run, and receives graded export files in Cloud Storage. What used to consume a team member for two full weeks is now a 15-minute task followed by an automated 6-hour pipeline execution.
Tech Stack
| Layer | Technology | Why |
|---|---|---|
| Pipeline | Apache Beam | Portable, scalable batch processing |
| Runtime | Google Cloud Dataflow | Managed, serverless, auto-scaling |
| Storage | BigQuery + Cloud Storage | Audit trail + export files |
| Secrets | Secret Manager | API keys and credentials |
| Dashboard | Flask (Python) | Lightweight self-service UI |
| APIs | REST + RapidAPI | Public listing data sources |
