GTA Labs — AI consulting that ships.

Data Pipeline Automation — 93% time reduction — Serverless Apache Beam pipeline replaced a 2-week manual process

The Problem

A company in the proptech space needed to aggregate and grade professional records from public listing APIs across the United States. Each cycle involved pulling millions of records, scoring individuals against configurable thresholds, and exporting enriched data for their sales team.

The existing process was entirely manual — a team member spent two weeks per cycle working through spreadsheets, paginating APIs by hand, and calculating grades in Excel. The process ran quarterly, and each cycle was error-prone and exhausting.

What We Built

Phase 1 — Prototype

We started with a custom Python application that dynamically paginated through the API, aggregating records per individual until configurable thresholds were met (e.g., $40M total volume for the top tier). This prototype validated the data model and API integration patterns, handling rate limiting, exponential backoff retries, and malformed response recovery.

Phase 2 — Production Pipeline

The production system was rebuilt on Apache Beam, deployed as a Google Cloud Dataflow template. This gave the client a managed, serverless pipeline that auto-scales workers based on data volume — no infrastructure to maintain.

Pipeline Architecture: 3-stage fan-out design. Stage 1 fetches profile data and calculates initial scores. Stages 2 and 3 paginate through active and historical records respectively, with iterative loop unrolling (up to 19 pages deep) and consecutive-failure circuit breakers. All stages write raw API responses to BigQuery for audit and reprocessing.

Phase 3 — Self-Service Dashboard

A Flask-based web dashboard lets the client's team trigger pipeline runs with configurable parameters: grade thresholds, date cutoffs, whether to include active records, and which tiers to process. The dashboard writes batch tracking records to BigQuery and monitors pipeline status — the team runs it themselves without engineering support.

Results

The pipeline runs quarterly with zero engineering involvement. The client configures parameters through the dashboard, triggers the run, and receives graded export files in Cloud Storage. What used to consume a team member for two full weeks is now a 15-minute task followed by an automated 6-hour pipeline execution.

Tech Stack

Layer	Technology	Why
Pipeline	Apache Beam	Portable, scalable batch processing
Runtime	Google Cloud Dataflow	Managed, serverless, auto-scaling
Storage	BigQuery + Cloud Storage	Audit trail + export files
Secrets	Secret Manager	API keys and credentials
Dashboard	Flask (Python)	Lightweight self-service UI
APIs	REST + RapidAPI	Public listing data sources

Download the full case study (PDF).

Automated Data Pipeline — Proptech Industry

Key Results

The Problem

What We Built

Phase 1 — Prototype

Phase 2 — Production Pipeline

Phase 3 — Self-Service Dashboard

Results

Tech Stack

Tech Stack

Tree of Trust — On-Device AI Therapist

Multi-Agent Harness — AI Agents That QA Each Other's Code