Skip to main content

1. The “Manual Mapping” Tax

If you’ve ever spent an entire afternoon staring at a spreadsheet, trying to figure out which SNOMED code matches your hospital’s local lab entry for “GLUC_FAST_SER” - you know the pain. It’s a hidden tax on your time. And it compounds. Traditional ETL processes for converting raw healthcare data into the OMOP Common Data Model (CDM) are slow, expensive, and riddled with human error. Every local code, every proprietary label, every physician shorthand needs to be manually mapped to a standardized vocabulary. Multiply that across thousands of concepts, and you’ve got weeks of work that doesn’t even feel like real work. But here’s the thing: most of that mapping is repetitive and predictable. And predictable work is exactly the kind of work that should be automated. That’s where OMOPHub comes in. It’s a vocabulary API that gives you instant, programmatic access to the OHDSI ATHENA vocabularies: SNOMED, ICD-10, LOINC, RxNorm, and 100+ others - without needing to set up and maintain a local PostgreSQL database. Combine it with NLP tools or LLMs for the text extraction step, and you’ve got a powerful end-to-end clinical coding pipeline. The goal? Shift from being a data cleaner to a clinical researcher. Let the tools handle the 80%, so you can focus your expertise on the 20% that actually matters.

2. The Core Concept: From Raw Text to Standardized Concepts

At its heart, clinical coding automation is about bridging the gap between diverse, messy source data and the structured world of the OMOP CDM. Think of it as a two-stage translation process:
  1. Extract: Use NLP tools (like MedCAT, cTAKES, Amazon Comprehend Medical, or an LLM) to identify clinical entities from unstructured or semi-structured text - conditions, medications, procedures, lab tests.
  2. Map & Validate: Use OMOPHub’s vocabulary API to translate those extracted entities into standardized OMOP concept IDs (SNOMED, ICD-10, LOINC, etc.), and review the results with confidence metrics for human-in-the-loop quality assurance.
Why OMOPHub for the mapping step? Traditionally, querying OMOP vocabularies meant downloading multi-gigabyte ATHENA CSV files and loading them into your own database. OMOPHub eliminates that overhead. Install the Python SDK, add your API key, and start searching concepts, traversing hierarchies, and building mappings in minutes. It covers all major OHDSI vocabularies with bi-annual updates, batch operations, and healthcare-grade security. It’s not an NLP engine. It’s the vocabulary backbone that makes your NLP pipeline clinically accurate.

3. Use Case A: Mapping Entities from Physician Notes (Unstructured Data)

Imagine you’re a clinical researcher tasked with analyzing thousands of physician notes to identify patients with specific conditions. Manually reading those notes is impractical. But even after you run NLP extraction, you still need to link extracted terms to standardized OMOP concepts. The Scenario: A researcher uses an NLP tool to extract clinical entities from free-text physician notes, then uses OMOPHub to map those entities to standard SNOMED or ICD-10 concepts. The Two-Step Workflow:
  • Step 1: An NLP tool (e.g., MedCAT, an LLM, or Amazon Comprehend Medical) processes the note and extracts entities like “acute myocardial infarction,” “Type 2 Diabetes Mellitus,” and “hypertension.”
  • Step 2: OMOPHub’s search API maps each extracted entity to the correct OMOP concept ID.
Code Snippet: Looking Up Extracted Entities via OMOPHub First, install the OMOPHub Python SDK:
pip install omophub
Now, let’s see how to use it:
Python
import omophub

# Initialize the OMOPHub client with your API key
# Set OMOPHUB_API_KEY as an environment variable
client = omophub.OMOPHub()

physician_note = """
Patient presented with severe chest pain radiating to the left arm,
shortness of breath, and diaphoresis. Initial ECG showed ST elevation
in leads V2-V4. Suspected acute myocardial infarction.
Past medical history includes Type 2 Diabetes Mellitus and hypertension.
"""

# Step 1: Extract entities using your NLP tool of choice.
# This is a simplified placeholder - in production you'd use
# MedCAT, cTAKES, Amazon Comprehend Medical, SciSpacy, or an LLM.
extracted_conditions = [
    "acute myocardial infarction",
    "Type 2 Diabetes Mellitus",
    "hypertension",
    "chest pain",
    "shortness of breath",
    "diaphoresis",
]

# Step 2: Map each extracted entity to OMOP concepts via OMOPHub
print("\nMapping extracted conditions to OMOP concepts:")

for condition in extracted_conditions:
    try:
        results = client.search.basic(
            query=condition,
            vocabulary_ids=["SNOMED", "ICD10CM"],
            standard_concept="S",  # Only standard concepts
            page_size=3,
        )

        if results and results.get("data"):
            top_match = results["data"][0]
            concept_name = top_match.get("concept_name", "N/A")
            concept_id = top_match.get("concept_id", "N/A")
            vocabulary_id = top_match.get("vocabulary_id", "N/A")
            print(
                f"  '{condition}' -> {concept_name} "
                f"(ID: {concept_id}, Vocab: {vocabulary_id})"
            )
        else:
            print(f"  '{condition}' -> No standard concept found")

    except omophub.APIStatusError as e:
        print(f"  API Error for '{condition}': {e.status_code}")
    except Exception as e:
        print(f"  Unexpected error for '{condition}': {e}")
The Key Insight: For large-scale research, “perfect” manual extraction is an illusion. The sheer volume of data makes it impractical and introduces its own inconsistencies. An automated NLP + vocabulary lookup pipeline provides a consistent, scalable baseline. You get 80% of the way there with a fraction of the effort, and then apply human expertise to the edge cases. That’s the leverage.

4. Use Case B: Scaling Lab Code Mapping (Semi-Structured Data)

Beyond unstructured notes, healthcare systems are full of local, proprietary codes for lab tests, medications, and procedures. Mapping these to standard vocabularies like LOINC or RxNorm is foundational for data interoperability. The Scenario: A data engineer needs to map local lab test names to their corresponding LOINC concept IDs. The Logic: For semi-structured data where you already have specific local names (not free text), OMOPHub’s search and mapping APIs are ideal. Search for each local code by name, find the best standard concept match, then retrieve its cross-vocabulary mappings if needed. Code Snippet: Mapping Local Lab Codes to LOINC
Python
import omophub

client = omophub.OMOPHub()

local_lab_codes = [
    "Fasting glucose",
    "Hemoglobin A1c",
    "Serum creatinine",
    "TSH level",
    "Urine sodium",
    "NON_EXISTENT_LAB_TEST",  # Example of something that might not map
]

print("\nMapping local lab codes to LOINC:")

for lab_code in local_lab_codes:
    try:
        # Search for the lab concept in LOINC vocabulary
        results = client.search.basic(
            query=lab_code,
            vocabulary_ids=["LOINC"],
            standard_concept="S",
            page_size=1,
        )

        if results and results.get("data") and len(results["data"]) > 0:
            match = results["data"][0]
            concept_name = match.get("concept_name", "N/A")
            concept_id = match.get("concept_id", "N/A")
            concept_code = match.get("concept_code", "N/A")
            print(
                f"  Local: '{lab_code}' -> LOINC: {concept_name} "
                f"(Concept ID: {concept_id}, LOINC Code: {concept_code})"
            )

            # Optionally: get mappings to other vocabularies
            # mappings = client.mappings.get(concept_id, target_vocabulary="SNOMED")
        else:
            print(f"  Local: '{lab_code}' -> No LOINC match found (needs manual review)")

    except omophub.APIStatusError as e:
        print(f"  API Error for '{lab_code}': {e.status_code}")
    except Exception as e:
        print(f"  Unexpected error for '{lab_code}': {e}")
The Key Insight: This is the “set it and automate it” mindset. By scripting the lookup of common local codes, data engineers can build ETL pipelines that handle the vast majority of transformations automatically. The unmapped codes get flagged for manual review. Over time, as you resolve edge cases, you build a growing lookup table that makes every subsequent ETL run faster. Less technical debt, faster time-to-analysis.

5. Validation & Human-in-the-Loop

Automation is powerful, but it’s not infallible. Clinical data is complex, and no automated system achieves 100% accuracy without validation. This is where the “human-in-the-loop” step earns its keep. Here’s how to build a practical validation workflow:
  • Flag uncertain mappings: If a search returns multiple plausible matches or the top result doesn’t look right, flag it for manual review by a clinical expert. You can use heuristics like comparing the returned concept_name against the original term, or checking whether the concept’s domain_id matches your expectation.
  • Prioritize review: Instead of reviewing every mapping, clinical experts focus on the ambiguous or high-impact cases - the codes that affect cohort definitions or safety endpoints.
  • Iterate and improve: Feedback from reviewers feeds back into your mapping logic. Resolved edge cases become lookup rules. The system gets smarter with each ETL cycle.
The Practical Workflow: Export a CSV containing the original local code, the proposed OMOP concept, its concept ID, vocabulary, and a column for “Reviewed (Y/N)” and “Corrected Concept ID.” This gives clinical experts a clean, sortable review surface. Simple, effective, scalable.

6. Conclusion: Reclaiming Your Time

Clinical coding automation isn’t about replacing clinical expertise - it’s about deploying it where it matters most. The combination of NLP tools for entity extraction and OMOPHub for vocabulary lookup and mapping gives you a pipeline that handles the repetitive 80% while surfacing the 20% that needs your judgment. By integrating OMOPHub into your ETL workflow, you go from maintaining local vocabulary databases and doing manual lookups to making simple API calls. That’s less infrastructure, faster iterations, and more time spent on actual research. Try the Python snippets with your own data. Start with a small batch of local codes. See how many map cleanly on the first pass. I think you’ll be surprised at how much time you get back. The “manual mapping tax”? Consider it optional.