Clinical Coding Automation

1. The “Manual Mapping” Tax

If you’ve ever spent an entire afternoon staring at a spreadsheet, trying to figure out which SNOMED code matches your hospital’s local lab entry for “GLUC_FAST_SER” - you know the pain. It’s a hidden tax on your time. And it compounds. Traditional ETL processes for converting raw healthcare data into the OMOP Common Data Model (CDM) are slow, expensive, and riddled with human error. Every local code, every proprietary label, every physician shorthand needs to be manually mapped to a standardized vocabulary. Multiply that across thousands of concepts, and you’ve got weeks of work that doesn’t even feel like real work. But here’s the thing: most of that mapping is repetitive and predictable. And predictable work is exactly the kind of work that should be automated. That’s where OMOPHub comes in. It’s a vocabulary API that gives you instant, programmatic access to the OHDSI ATHENA vocabularies: SNOMED, ICD-10, LOINC, RxNorm, and 100+ others - without needing to set up and maintain a local PostgreSQL database. Combine it with NLP tools or LLMs for the text extraction step, and you’ve got a powerful end-to-end clinical coding pipeline. The goal? Shift from being a data cleaner to a clinical researcher. Let the tools handle the 80%, so you can focus your expertise on the 20% that actually matters.

2. The Core Concept: From Raw Text to Standardized Concepts

At its heart, clinical coding automation is about bridging the gap between diverse, messy source data and the structured world of the OMOP CDM. Think of it as a two-stage translation process:

Extract: Use NLP tools (like MedCAT, cTAKES, Amazon Comprehend Medical, or an LLM) to identify clinical entities from unstructured or semi-structured text - conditions, medications, procedures, lab tests.
Map & Validate: Use OMOPHub’s vocabulary API to translate those extracted entities into standardized OMOP concept IDs (SNOMED, ICD-10, LOINC, etc.), and review the results with confidence metrics for human-in-the-loop quality assurance.

Why OMOPHub for the mapping step? Traditionally, querying OMOP vocabularies meant downloading multi-gigabyte ATHENA CSV files and loading them into your own database. OMOPHub eliminates that overhead. Install the Python SDK, add your API key, and start searching concepts, traversing hierarchies, and building mappings in minutes. It covers all major OHDSI vocabularies with bi-annual updates, batch operations, and healthcare-grade security. It’s not an NLP engine. It’s the vocabulary backbone that makes your NLP pipeline clinically accurate.

3. Use Case A: Mapping Entities from Physician Notes (Unstructured Data)

Imagine you’re a clinical researcher tasked with analyzing thousands of physician notes to identify patients with specific conditions. Manually reading those notes is impractical. But even after you run NLP extraction, you still need to link extracted terms to standardized OMOP concepts. The Scenario: A researcher uses an NLP tool to extract clinical entities from free-text physician notes, then uses OMOPHub to map those entities to standard SNOMED or ICD-10 concepts. The Two-Step Workflow:

Step 1: An NLP tool (e.g., MedCAT, an LLM, or Amazon Comprehend Medical) processes the note and extracts entities like “acute myocardial infarction,” “Type 2 Diabetes Mellitus,” and “hypertension.”
Step 2: OMOPHub’s search API maps each extracted entity to the correct OMOP concept ID.

Code Snippet: Looking Up Extracted Entities via OMOPHub First, install the OMOPHub Python SDK:

pip install omophub

Now, let’s see how to use it:

Python

import omophub

# Initialize the OMOPHub client with your API key
# Set OMOPHUB_API_KEY as an environment variable
client = omophub.OMOPHub()

physician_note = """
Patient presented with severe chest pain radiating to the left arm,
shortness of breath, and diaphoresis. Initial ECG showed ST elevation
in leads V2-V4. Suspected acute myocardial infarction.
Past medical history includes Type 2 Diabetes Mellitus and hypertension.
"""

# Step 1: Extract entities using your NLP tool of choice.
# This is a simplified placeholder - in production you'd use
# MedCAT, cTAKES, Amazon Comprehend Medical, SciSpacy, or an LLM.
extracted_conditions = [
    "acute myocardial infarction",
    "Type 2 Diabetes Mellitus",
    "hypertension",
    "chest pain",
    "shortness of breath",
    "diaphoresis",
]

# Step 2: Map each extracted entity to OMOP concepts via OMOPHub
print("\nMapping extracted conditions to OMOP concepts:")

for condition in extracted_conditions:
    try:
        results = client.search.basic(
            query=condition,
            vocabulary_ids=["SNOMED", "ICD10CM"],
            standard_concept="S",  # Only standard concepts
            page_size=3,
        )

        if results and results.get("data"):
            top_match = results["data"][0]
            concept_name = top_match.get("concept_name", "N/A")
            concept_id = top_match.get("concept_id", "N/A")
            vocabulary_id = top_match.get("vocabulary_id", "N/A")
            print(
                f"  '{condition}' -> {concept_name} "
                f"(ID: {concept_id}, Vocab: {vocabulary_id})"
            )
        else:
            print(f"  '{condition}' -> No standard concept found")

    except omophub.APIError as e:
        print(f"  API Error for '{condition}': {e.status_code}")
    except Exception as e:
        print(f"  Unexpected error for '{condition}': {e}")

The Key Insight: For large-scale research, “perfect” manual extraction is an illusion. The sheer volume of data makes it impractical and introduces its own inconsistencies. An automated NLP + vocabulary lookup pipeline provides a consistent, scalable baseline. You get 80% of the way there with a fraction of the effort, and then apply human expertise to the edge cases. That’s the leverage.

4. Use Case B: Scaling Lab Code Mapping (Semi-Structured Data)

Beyond unstructured notes, healthcare systems are full of local, proprietary codes for lab tests, medications, and procedures. Mapping these to standard vocabularies like LOINC or RxNorm is foundational for data interoperability. The Scenario: A data engineer needs to map local lab test names to their corresponding LOINC concept IDs. The Logic: For semi-structured data where you already have specific local names (not free text), OMOPHub’s search and mapping APIs are ideal. Search for each local code by name, find the best standard concept match, then retrieve its cross-vocabulary mappings if needed. Code Snippet: Mapping Local Lab Codes to LOINC

Python

import omophub

client = omophub.OMOPHub()

local_lab_codes = [
    "Fasting glucose",
    "Hemoglobin A1c",
    "Serum creatinine",
    "TSH level",
    "Urine sodium",
    "NON_EXISTENT_LAB_TEST",  # Example of something that might not map
]

print("\nMapping local lab codes to LOINC:")

for lab_code in local_lab_codes:
    try:
        # Search for the lab concept in LOINC vocabulary
        results = client.search.basic(
            query=lab_code,
            vocabulary_ids=["LOINC"],
            standard_concept="S",
            page_size=1,
        )

        if results and results.get("data") and len(results["data"]) > 0:
            match = results["data"][0]
            concept_name = match.get("concept_name", "N/A")
            concept_id = match.get("concept_id", "N/A")
            concept_code = match.get("concept_code", "N/A")
            print(
                f"  Local: '{lab_code}' -> LOINC: {concept_name} "
                f"(Concept ID: {concept_id}, LOINC Code: {concept_code})"
            )

            # Optionally: get mappings to other vocabularies
            # mappings = client.mappings.get(concept_id, target_vocabulary="SNOMED")
        else:
            print(f"  Local: '{lab_code}' -> No LOINC match found (needs manual review)")

    except omophub.APIError as e:
        print(f"  API Error for '{lab_code}': {e.status_code}")
    except Exception as e:
        print(f"  Unexpected error for '{lab_code}': {e}")

The Key Insight: This is the “set it and automate it” mindset. By scripting the lookup of common local codes, data engineers can build ETL pipelines that handle the vast majority of transformations automatically. The unmapped codes get flagged for manual review. Over time, as you resolve edge cases, you build a growing lookup table that makes every subsequent ETL run faster. Less technical debt, faster time-to-analysis.

5. The Modern Shortcut: Semantic Search and the FHIR Concept Resolver

The loops above are useful for understanding what a clinical coding pipeline needs to do. In production, you probably want two higher-level tools that collapse several steps into one API call each:

client.search.semantic - neural (BioLORD-2023-C) similarity instead of keyword matching. Handles synonyms, abbreviations, and physician shorthand that keyword search misses.
client.fhir.resolve / resolve_batch - combines URI lookup, standard-concept mapping, CDM target-table assignment, and a semantic-search fallback in a single call. Originally designed for FHIR-coded data, but the display field makes it equally useful for plain-text input.

When to use which:

Input shape	Recommended call
Free text from NLP (`"acute MI"`)	`client.search.semantic(query=...)` for a ranked list, or `client.fhir.resolve(display=...)` when you also want the CDM target table
Local lab name (`"Hb A1c"`)	`client.fhir.resolve_batch([{"display": "Hb A1c"}, ...])` - batched, with CDM table assignment
Structured FHIR `Coding` / `CodeableConcept`	`client.fhir.resolve(...)` / `resolve_codeable_concept(...)` - see the FHIR-to-OMOP workflow

Semantic search for messy extracted text

Python

import omophub

client = omophub.OMOPHub()

# Same entities extracted from a physician note
extracted = [
    "acute MI",                   # abbreviation - keyword search would miss
    "T2DM",                       # acronym
    "Hgb A1c elevated",           # shorthand + finding
]

for phrase in extracted:
    results = client.search.semantic(
        query=phrase,
        vocabulary_ids=["SNOMED", "LOINC"],
        standard_concept="S",
        page_size=1,
        min_score=0.60,
    )
    if results.get("data"):
        top = results["data"][0]
        print(
            f"  '{phrase}' -> {top['concept_name']} "
            f"(score {top['similarity_score']:.2f}, {top['domain_id']})"
        )

Semantic search accepts a min_score threshold so you can auto-flag low-confidence matches for human review.

One-call resolution with CDM target table

fhir.resolve gives you the standard concept and the OMOP CDM table in the same response. For entity-level ETL this removes a whole step - you no longer need to look up the domain and pick the right table downstream.

Python

# Free-text input: semantic fallback picks up the right concept
result = client.fhir.resolve(
    display="acute myocardial infarction",
    resource_type="Condition",
)
res = result["resolution"]
print(res["standard_concept"]["concept_id"])    # 312327
print(res["standard_concept"]["concept_name"])  # "Acute myocardial infarction"
print(res["target_table"])                       # "condition_occurrence"
print(res["mapping_type"])                       # "semantic_match"

Batch the lab-code pass

Replacing the per-code loop in Use Case B with a single batch call cuts latency and API usage. Each batch counts as one call against your quota regardless of item count (see Batch & Performance):

Python

local_lab_codes = [
    "Fasting glucose",
    "Hemoglobin A1c",
    "Serum creatinine",
    "TSH level",
    "Urine sodium",
]

# One API call for all five codes
result = client.fhir.resolve_batch(
    [{"display": name} for name in local_lab_codes],
    resource_type="Observation",
)

print(f"Resolved {result['summary']['resolved']}/{result['summary']['total']}")

for name, item in zip(local_lab_codes, result["results"]):
    if "resolution" in item:
        std = item["resolution"]["standard_concept"]
        print(
            f"  '{name}' -> {std['concept_name']} "
            f"(LOINC {std['concept_code']}, {item['resolution']['target_table']})"
        )
    else:
        print(f"  '{name}' -> unresolved, needs manual review")

For 100+ items, chunk into groups of 100 (the batch endpoint’s max) and send the chunks sequentially. The Lean ETL Mapping Cache pattern keeps subsequent ETL runs fast by caching resolved mappings locally.

6. Validation & Human-in-the-Loop

Automation is powerful, but it’s not infallible. Clinical data is complex, and no automated system achieves 100% accuracy without validation. This is where the “human-in-the-loop” step earns its keep. Here’s how to build a practical validation workflow:

Flag uncertain mappings: If a search returns multiple plausible matches or the top result doesn’t look right, flag it for manual review by a clinical expert. You can use heuristics like comparing the returned concept_name against the original term, or checking whether the concept’s domain_id matches your expectation.
Prioritize review: Instead of reviewing every mapping, clinical experts focus on the ambiguous or high-impact cases - the codes that affect cohort definitions or safety endpoints.
Iterate and improve: Feedback from reviewers feeds back into your mapping logic. Resolved edge cases become lookup rules. The system gets smarter with each ETL cycle.

The Practical Workflow: Export a CSV containing the original local code, the proposed OMOP concept, its concept ID, vocabulary, and a column for “Reviewed (Y/N)” and “Corrected Concept ID.” This gives clinical experts a clean, sortable review surface. Simple, effective, scalable.

7. Conclusion: Reclaiming Your Time

Clinical coding automation isn’t about replacing clinical expertise - it’s about deploying it where it matters most. The combination of NLP tools for entity extraction and OMOPHub for vocabulary lookup and mapping gives you a pipeline that handles the repetitive 80% while surfacing the 20% that needs your judgment. By integrating OMOPHub into your ETL workflow, you go from maintaining local vocabulary databases and doing manual lookups to making simple API calls. That’s less infrastructure, faster iterations, and more time spent on actual research. Try the Python snippets with your own data. Start with a small batch of local codes. See how many map cleanly on the first pass. I think you’ll be surprised at how much time you get back. The “manual mapping tax”? Consider it optional.

FHIR-to-OMOP Standardization

The end-to-end workflow from FHIR-coded data to populated OMOP CDM tables, with the Concept Resolver as the central primitive.

Laboratory Result Mapping

Dedicated deep-dive on local lab codes to LOINC, including unit handling, reference ranges, and UCUM alignment.

Batch & Performance

Dedupe-then-batch rules, caching patterns, and when to reach for semantic search vs keyword search.

Python SDK: FHIR Resolver

Full reference for client.fhir.resolve / resolve_batch / resolve_codeable_concept, including type interop.

Getting Started

Integrations

Workflows

Production

1. The “Manual Mapping” Tax

2. The Core Concept: From Raw Text to Standardized Concepts

3. Use Case A: Mapping Entities from Physician Notes (Unstructured Data)

4. Use Case B: Scaling Lab Code Mapping (Semi-Structured Data)

5. The Modern Shortcut: Semantic Search and the FHIR Concept Resolver

Semantic search for messy extracted text

One-call resolution with CDM target table

Batch the lab-code pass

6. Validation & Human-in-the-Loop

7. Conclusion: Reclaiming Your Time

FHIR-to-OMOP Standardization

Laboratory Result Mapping

Batch & Performance

Python SDK: FHIR Resolver

​1. The “Manual Mapping” Tax

​2. The Core Concept: From Raw Text to Standardized Concepts

​3. Use Case A: Mapping Entities from Physician Notes (Unstructured Data)

​4. Use Case B: Scaling Lab Code Mapping (Semi-Structured Data)

​5. The Modern Shortcut: Semantic Search and the FHIR Concept Resolver

​Semantic search for messy extracted text

​One-call resolution with CDM target table

​Batch the lab-code pass

​6. Validation & Human-in-the-Loop

​7. Conclusion: Reclaiming Your Time

​8. Related Guides

FHIR-to-OMOP Standardization

Laboratory Result Mapping

Batch & Performance

Python SDK: FHIR Resolver

1. The “Manual Mapping” Tax

2. The Core Concept: From Raw Text to Standardized Concepts

3. Use Case A: Mapping Entities from Physician Notes (Unstructured Data)

4. Use Case B: Scaling Lab Code Mapping (Semi-Structured Data)

5. The Modern Shortcut: Semantic Search and the FHIR Concept Resolver

Semantic search for messy extracted text

One-call resolution with CDM target table

Batch the lab-code pass

6. Validation & Human-in-the-Loop

7. Conclusion: Reclaiming Your Time

8. Related Guides