1. The Mapping Bottleneck
Every OMOP ETL has a mapping step: take the source system’s local codes and translate them to standard OMOP concept IDs. This step has two costs that people conflate but shouldn’t: Cost 1: The one-time setup. Figuring out what each local code maps to. “Cr_Serum” → LOINC 2160-0. “WBC_Count” → LOINC 6690-2. “Local_Sepsis_Dx” → SNOMED 91302008. This is the hard intellectual work - searching vocabularies, validating matches, handling ambiguity. Cost 2: The runtime application. Applying those mappings to millions of records every night. This is a SQL JOIN or pandas merge - fast, mechanical, and local. The mistake is conflating these two costs and loading a full Athena vocabulary database to handle both. You don’t need 4GB of vocabulary tables to map 500 unique local codes. But you also can’t make HTTP API calls for every row in a million-record batch. OMOPHub is the right tool for Cost 1: the mapping discovery and validation phase. Search for your local codes, verify the matches, build a mapping cache. Then use that cache - locally, with no API calls - for Cost 2: the nightly production run. This is “Lean ETL”: use OMOPHub to build smart, validated mapping files during development. Apply those mappings via local lookups during production. No full vocabulary load needed for the ETL. No API latency during batch processing. Best of both worlds.2. The Core Concept: The Mapping Cache Pattern
The workflow has three phases: Phase 1: Extract unique source codes. Your source data has millions of records but a much smaller set of unique codes. A hospital lab system might produce 500,000 lab results per month, but only use 800 unique local lab codes. Extract those 800 codes - that’s your mapping workload. Phase 2: Look up each unique code via OMOPHub. Send each of the 800 codes to OMOPHub (search by display name, code, or fuzzy match). Get back the standard OMOP concept ID, name, vocabulary, and domain. This takes a few minutes for 800 codes. Save the results as a mapping file (CSV, JSON, or database table). Phase 3: Apply the mapping cache in production. Your nightly ETL reads the mapping file, joins it to the source data via local lookup (pandas merge, SQL JOIN, dictionary lookup), and writes the enriched records to OMOP CDM tables. Zero API calls. Full local speed. When to re-run Phase 2: When new source codes appear that aren’t in your mapping cache. Your ETL detects unmapped codes, queues them for OMOPHub lookup, and a human (or automated process) reviews and approves the new mappings before they enter the cache.3. Use Case A: Building a Mapping Cache for a Sepsis Study
A multi-site sepsis research project receives data from four hospitals. Each hospital uses different local codes for lactate, WBC, blood cultures, and sepsis diagnoses. The ETL needs a consistent mapping from all local codes to standard OMOP concepts.Python
4. Use Case B: Applying the Mapping Cache in Production
Once you have the mapping cache, the nightly ETL is pure local processing - no API calls.Python
5. When New Codes Appear
The mapping cache isn’t static. As hospitals add new tests or change local codes, unmapped codes appear. The workflow:- ETL detects unmapped codes (the
unmapped_recordsin Use Case B) - Queue the new codes for OMOPHub lookup
- Run the Phase 2 script (Use Case A) on just the new codes
- Human reviews the suggested mappings (critical for data quality)
- Append approved mappings to the cache
- Re-run the ETL for the previously unmapped records
Python
6. Conclusion: OMOPHub for Discovery, Local Cache for Production
The “Compute Tax” isn’t loading vocabulary tables - it’s doing vocabulary lookups at runtime instead of build time. The lean approach:- Build time: Use OMOPHub to discover and validate mappings for your unique source codes. Takes minutes. Run it once per new site or when new codes appear.
- Runtime: Apply the mapping cache via local pandas merge or SQL JOIN. Takes milliseconds per million records. No API calls. No network dependency.