Gap Analysis and OHDSI Contributions

1. The “Contribution Friction”

Every institution converting local EHR data to OMOP hits unmappable codes. Local lab abbreviations with no LOINC equivalent. Institutional procedure codes that don’t exist in SNOMED. Novel biomarkers that haven’t been added to any standard vocabulary yet. These unmapped codes are gaps - and they’re valuable. Each one is a potential contribution to the OHDSI vocabulary ecosystem. If your hospital has a local code for a clinical concept that SNOMED doesn’t cover, and you identify it, the OHDSI Vocabulary Team can add it in the next Athena release. Every institution that converts to OMOP after that benefits from your discovery. But finding these gaps is tedious. You have 3,000 local codes. Most will map fine. Maybe 50 are genuinely missing from the standard vocabularies. The other 2,950 are just lookup work. How do you find the 50 that matter? OMOPHub makes the gap detection step fast. Search for each local code programmatically. The ones that return no match (or only weak matches) are your gap candidates. That’s your shortlist for human review and potential OHDSI contribution. What OMOPHub does not do: submit contributions to OHDSI, add concepts to Athena, or determine whether a gap is “truly missing” vs. “just badly named.” Gap detection is automated; gap interpretation requires human clinical expertise. The OHDSI contribution itself goes through the community forums and vocabulary team - not through an API.

2. The Core Concept: Search Failures as Signals

The insight is simple: if OMOPHub’s search can’t find a match for a clinical term, that’s information. Tiered search strategy:

Basic search with vocabulary and domain filters - catches exact and close matches
Semantic search - catches misspellings, abbreviations, word-order variations
If both fail - flag as a gap candidate

The tier matters because the type of failure tells you something:

Basic fails, semantic succeeds → probably a naming/abbreviation issue, not a real gap
Both fail → likely a genuine gap or very institution-specific term
Both return results, but wrong domain → mapping ambiguity, needs human review

This tiered approach reduces false positives (codes flagged as gaps that actually have mappings) and produces a cleaner shortlist for expert review.

3. Use Case A: Automated Gap Detection for Local Codes

A hospital joining an OMOP research network has 200 unique local diagnosis codes related to sepsis. The data engineer needs to identify which ones have no standard OMOP equivalent - those are the gaps to review manually and potentially contribute to OHDSI.

pip install omophub

Python

import omophub

client = omophub.OMOPHub()

# Local codes to analyze (in production: extracted from your source system)
local_codes = [
    {"code": "Sepsis_OD", "display": "Sepsis with Organ Dysfunction"},
    {"code": "Bacteremia_NOS", "display": "Bacteremia, unspecified"},
    {"code": "Hypotension_Sep", "display": "Hypotension due to Sepsis"},
    {"code": "BiomarkerXYZ", "display": "Novel Sepsis Biomarker XYZ"},
    {"code": "ICD_A41.9", "display": "Sepsis, unspecified organism"},
    {"code": "AtypSepsis", "display": "Atypical Sepsis Presentation"},
    {"code": "qSOFA_Pos", "display": "Positive qSOFA Score"},
]

print("Automated Gap Analysis\n")

results = []

for entry in local_codes:
    code = entry["code"]
    display = entry["display"]

    match_found = False
    match_tier = None
    match_concept = None

    # --- Tier 1: Basic search ---
    try:
        basic = client.search.basic(
            display,
            vocabulary_ids=["SNOMED", "ICD10CM"],
            domain_ids=["Condition"],
            page_size=3,
        )
        candidates = basic.get("concepts", []) if basic else []

        # Step 1: Filter to standard concepts only
        standard = [c for c in candidates if c.get("standard_concept") == "S"]

        if standard:
            match_found = True
            match_tier = "basic"
            match_concept = standard[0]
    except omophub.APIError:
        pass

    # --- Tier 2: Semantic search (if basic failed) ---
    if not match_found:
        try:
            semantic = client.search.semantic(display, domain_ids=["Condition"], standard_concept="S", page_size=3)
            candidates = (semantic.get("results", semantic.get("concepts", [])) if semantic else [])

            if candidates:
                match_found = True
                match_tier = "semantic"
                match_concept = candidates[0]
        except omophub.APIError:
            pass

    # --- Record result ---
    if match_found:
        name = match_concept.get("concept_name", "Unknown")
        cid = match_concept["concept_id"]
        vocab = match_concept.get("vocabulary_id", "N/A")
        print(f"  MAPPED [{match_tier}]: '{display}' -> {name} ({vocab}: {cid})")
        results.append({
            "local_code": code,
            "local_display": display,
            "status": "mapped",
            "match_tier": match_tier,
            "omop_concept_id": cid,
            "omop_concept_name": name,
            "omop_vocabulary": vocab,
        })
    else:
        print(f"  GAP CANDIDATE: '{display}' - no standard match found")
        results.append({
            "local_code": code,
            "local_display": display,
            "status": "gap_candidate",
            "match_tier": None,
            "omop_concept_id": None,
            "omop_concept_name": None,
            "omop_vocabulary": None,
        })

# --- Summary ---
mapped = [r for r in results if r["status"] == "mapped"]
gaps = [r for r in results if r["status"] == "gap_candidate"]

print(f"\n--- Gap Analysis Summary ---")
print(f"  Total codes: {len(results)}")
print(f"  Mapped: {len(mapped)} ({len([m for m in mapped if m['match_tier'] == 'basic'])} basic, {len([m for m in mapped if m['match_tier'] == 'semantic'])} semantic)")
print(f"  Gap candidates: {len(gaps)}")

if gaps:
    print(f"\n--- Gap Candidates for Human Review ---")
    for g in gaps:
        print(f"  Code: {g['local_code']:20s}  Display: {g['local_display']}")

The Key Insight: “Novel Sepsis Biomarker XYZ” and “Positive qSOFA Score” will likely fail both search tiers - the first because it’s genuinely novel, the second because clinical scoring concepts are sparsely represented in some vocabularies. “Sepsis, unspecified organism” will map (it’s essentially ICD-10 A41.9). “Bacteremia, unspecified” will likely map to a SNOMED concept. The gap candidates are the shortlist that needs human expert review.

4. Use Case B: Categorizing Gaps for Targeted Action

Not all gaps are the same. A gap candidate could be:

A genuinely missing concept - the clinical idea doesn’t exist in any standard vocabulary (e.g., a brand-new biomarker)
A missing mapping - the concept exists in SNOMED but not in the vocabulary you searched (e.g., exists as a Procedure, not a Condition)
A local abbreviation - the term is too institution-specific for search to match, but the concept exists under a different name
A composite concept - the local code combines multiple clinical ideas that are separate concepts in OMOP

Categorizing gaps helps prioritize action: missing concepts should be proposed to OHDSI; missing mappings might just need a broader search; abbreviations need local mapping work; composites need decomposition.

Python

import omophub

client = omophub.OMOPHub()

def categorize_gap(display_term):
    """
    Attempt to categorize a gap candidate by searching more broadly.
    Returns a category and any near-miss candidates found.
    """

    near_misses = []

    # Step 1: Broader search - drop domain filter, search all vocabularies
    try:
        broad = client.search.basic(display_term, page_size=5)
        candidates = broad.get("concepts", []) if broad else []

        if candidates:
            # Found something, but not in our original domain/vocab filter
            near_misses = [
                {
                    "concept_name": c.get("concept_name"),
                    "concept_id": c["concept_id"],
                    "domain": c.get("domain_id"),
                    "vocabulary": c.get("vocabulary_id"),
                    "standard": c.get("standard_concept"),
                }
                for c in candidates[:3]
            ]

            # Check if matches are in a different domain
            domains = set(c.get("domain_id") for c in candidates)
            if domains and "Condition" not in domains:
                return "wrong_domain", near_misses

            # Check if matches are non-standard (mapping exists but not standard)
            standards = [c.get("standard_concept") for c in candidates]
            if all(s != "S" for s in standards):
                return "non_standard_only", near_misses

            return "partial_match", near_misses
    except omophub.APIError:
        pass

    # Step 2: Try semantic search as last resort
    try:
        semantic = client.search.semantic(display_term)
        sem_list = (
            semantic if isinstance(semantic, list)
            else semantic.get("concepts", [])
        ) if semantic else []

        if sem_list:
            near_misses = [
                {
                    "concept_name": c.get("concept_name"),
                    "concept_id": c["concept_id"],
                    "domain": c.get("domain_id"),
                    "vocabulary": c.get("vocabulary_id"),
                }
                for c in sem_list[:3]
            ]
            return "semantic_near_miss", near_misses
    except omophub.APIError:
        pass

    # Nothing found anywhere
    return "genuinely_missing", []


# Categorize the gap candidates from Use Case A
gap_candidates = [
    "Novel Sepsis Biomarker XYZ",
    "Positive qSOFA Score",
    "Atypical Sepsis Presentation",
]

print("Categorizing gap candidates...\n")

for term in gap_candidates:
    category, near_misses = categorize_gap(term)
    print(f"  '{term}'")
    print(f"    Category: {category}")

    if near_misses:
        print(f"    Near misses:")
        for nm in near_misses:
            print(f"      - {nm['concept_name']} ({nm['domain']}/{nm['vocabulary']})")

    # Recommended action based on category
    actions = {
        "genuinely_missing": "-> Propose new concept to OHDSI Vocabulary Team via forums.ohdsi.org",
        "wrong_domain": "-> Concept exists but in different domain. Review if domain mapping is correct.",
        "non_standard_only": "-> Non-standard concept exists. Check if standard equivalent was missed.",
        "partial_match": "-> Partial matches found. Human review needed to select best match.",
        "semantic_near_miss": "-> Semantic match only. May be a naming/abbreviation issue.",
    }
    print(f"    Action: {actions.get(category, 'Unknown')}")
    print()

The Key Insight: “Novel Sepsis Biomarker XYZ” will likely categorize as genuinely_missing - no matches anywhere. That’s your OHDSI contribution candidate. “Positive qSOFA Score” might find semantic_near_miss results (qSOFA-related SNOMED concepts exist but may not match the exact phrasing). “Atypical Sepsis Presentation” might get partial_match - “Sepsis” matches, but “Atypical” is too vague for a specific concept. Each category drives a different action.

5. The OHDSI Contribution Pathway

Once you’ve identified genuinely missing concepts, the contribution process is community-driven, not API-driven:

Prepare a gap report from your analysis (the output of Use Cases A + B)
Post to OHDSI Forums (forums.ohdsi.org → Vocabulary category) describing:
- The missing concept and its clinical definition
- The source vocabulary where it originates (if applicable)
- How many records in your data use this code (impact/frequency)
- Any near-miss concepts from your OMOPHub search (shows you did due diligence)
OHDSI Vocabulary Team reviews the proposal and decides whether to add it
If accepted, it appears in the next Athena vocabulary release
Update your local vocabularies from Athena to get the new concept

OMOPHub helps with step 1 (automated gap detection and categorization). Steps 2-5 are community process. This is by design - vocabulary changes affect the entire OHDSI network and need human review. Tools that complement this workflow:

USAGI - OHDSI’s mapping tool for the manual review step (reviewing near-misses, approving mappings)
Athena - Where accepted contributions land (vocabulary downloads)
OMOPHub - Fast vocabulary search for gap detection (no local vocab DB needed)

6. Conclusion: Finding the Gaps That Matter

Gap analysis isn’t about finding every unmapped code - most unmapped codes are just local abbreviations that need manual mapping work. It’s about finding the codes that represent genuinely missing clinical concepts in the standard vocabularies. Those are the high-value contributions to OHDSI. The tiered search approach (basic → semantic → flag) combined with gap categorization (genuinely missing vs. wrong domain vs. naming issue) produces a focused shortlist that expert reviewers can act on efficiently. Instead of reviewing 3,000 local codes, they review 50 gap candidates, 10 of which might be genuine OHDSI contributions. OMOPHub makes the detection fast. The categorization helps prioritize. The OHDSI community process handles the rest. Start with your most problematic source system - the one with the most unmapped codes. Run the gap analysis. Categorize the results. Post the genuinely missing concepts to the OHDSI forums. Your contribution makes the next institution’s mapping work a little easier.

​1. The “Contribution Friction”

​2. The Core Concept: Search Failures as Signals

​3. Use Case A: Automated Gap Detection for Local Codes

​4. Use Case B: Categorizing Gaps for Targeted Action

​5. The OHDSI Contribution Pathway

​6. Conclusion: Finding the Gaps That Matter