← All Work·Proptech / Real Estate·4 weeks·Jun 2025

Automated Data Pipeline — Proptech Industry

Replaced a 2-week manual Excel process with a serverless Apache Beam pipeline on Google Cloud Dataflow. Runs quarterly in 6 hours, unattended.

Data EngineeringCloud Architecture (GCP)API IntegrationSelf-Service Dashboard

Key Results

Time Reduction
93%
Human Time/Cycle
15 min
API Throughput
~20/sec
Engineer Involvement
0

The Problem

A company in the proptech space needed to aggregate and grade professional records from public listing APIs across the United States. Each cycle involved pulling millions of records, scoring individuals against configurable thresholds, and exporting enriched data for their sales team.

The existing process was entirely manual — a team member spent two weeks per cycle working through spreadsheets, paginating APIs by hand, and calculating grades in Excel. The process ran quarterly, and each cycle was error-prone and exhausting.

What We Built

Phase 1 — Prototype

We started with a custom Python application that dynamically paginated through the API, aggregating records per individual until configurable thresholds were met (e.g., $40M total volume for the top tier). This prototype validated the data model and API integration patterns, handling rate limiting, exponential backoff retries, and malformed response recovery.

Phase 2 — Production Pipeline

The production system was rebuilt on Apache Beam, deployed as a Google Cloud Dataflow template. This gave the client a managed, serverless pipeline that auto-scales workers based on data volume — no infrastructure to maintain.

Pipeline Architecture: 3-stage fan-out design. Stage 1 fetches profile data and calculates initial scores. Stages 2 and 3 paginate through active and historical records respectively, with iterative loop unrolling (up to 19 pages deep) and consecutive-failure circuit breakers. All stages write raw API responses to BigQuery for audit and reprocessing.

Phase 3 — Self-Service Dashboard

A Flask-based web dashboard lets the client's team trigger pipeline runs with configurable parameters: grade thresholds, date cutoffs, whether to include active records, and which tiers to process. The dashboard writes batch tracking records to BigQuery and monitors pipeline status — the team runs it themselves without engineering support.

Results

The pipeline runs quarterly with zero engineering involvement. The client configures parameters through the dashboard, triggers the run, and receives graded export files in Cloud Storage. What used to consume a team member for two full weeks is now a 15-minute task followed by an automated 6-hour pipeline execution.

Tech Stack

LayerTechnologyWhy
PipelineApache BeamPortable, scalable batch processing
RuntimeGoogle Cloud DataflowManaged, serverless, auto-scaling
StorageBigQuery + Cloud StorageAudit trail + export files
SecretsSecret ManagerAPI keys and credentials
DashboardFlask (Python)Lightweight self-service UI
APIsREST + RapidAPIPublic listing data sources

Download the full case study (PDF).

Tech Stack

Apache BeamGoogle Cloud DataflowBigQueryCloud StoragePythonFlask
GTA Labs — AI consulting that ships.