Skip to main content

1. The “Hidden Code” Problem

You’d think defining “Type 2 Diabetes” in a database would be simple. Look for ICD-10 code E11. Done, right? Not even close. In OMOP CDM data, Type 2 Diabetes is a sprawling web of concepts: the top-level SNOMED code, dozens of specific subtypes (“Type 2 DM with renal complications,” “Type 2 DM with peripheral angiopathy”), equivalent ICD-10 codes, related lab measurements (HbA1c in LOINC), medications (Metformin, insulin glargine in RxNorm), and procedures. Miss even one obscure code and your research cohort is undercounted. One major study found their diabetes phenotype captured 23% fewer patients when they used a narrow concept set versus a properly expanded one. This is the omission bias problem - and it’s endemic in observational research. The codes are there in the vocabulary. You just can’t find them all manually. Concept set expansion - the process of systematically identifying every relevant concept ID for your phenotype - is the solution. It works in two dimensions:
  • Vertical (hierarchical): Follow the parent-child tree. “Type 2 Diabetes Mellitus” has dozens of SNOMED descendants. You want all of them.
  • Horizontal (cross-vocabulary): Map your SNOMED concepts to their ICD10CM, LOINC, and RxNorm equivalents, so your phenotype catches patients regardless of which coding system their data uses.
OMOPHub is a vocabulary API that makes both of these fast. Its hierarchy API traverses ancestor/descendant relationships in a single call. Its mappings API resolves concepts across vocabularies. And its search API (including fuzzy and semantic search) helps you discover concepts you didn’t know to look for. No local vocabulary database required - just API calls against the full OHDSI ATHENA vocabulary.

2. The Core Concept: Hierarchical vs. Semantic Expansion

When building a concept set, you’re doing two kinds of work: Hierarchical expansion follows the vocabulary’s built-in structure. In SNOMED, “Atrial Fibrillation” has child concepts like “Paroxysmal atrial fibrillation,” “Persistent atrial fibrillation,” and “Chronic atrial fibrillation.” OMOPHub’s hierarchy.descendants() walks this tree for you - specify a concept ID, set the depth, and get back every descendant. This is the “vertical” dimension: going deeper into specificity within a single vocabulary. Cross-vocabulary mapping is the “horizontal” dimension. Your phenotype might be defined in SNOMED, but your billing data uses ICD-10-CM. OMOPHub’s mappings.get() finds the “Maps to” relationships between vocabularies, ensuring your phenotype works regardless of how the data was originally coded. Semantic search adds a discovery layer. If you search OMOPHub for “atrial fibrillation,” the semantic and fuzzy search methods can surface related concepts you might not have thought to include - terms worded differently, abbreviations, or clinically adjacent concepts that share a vocabulary neighborhood. This isn’t AI “understanding” the clinical meaning; it’s smart text matching against a comprehensive vocabulary index. But it’s remarkably effective at catching concepts that pure hierarchy traversal would miss. OMOPHub handles the vocabulary lookup. The clinical judgment - deciding which concepts belong in your phenotype and which don’t - still belongs to you.

3. Use Case A: Rapid Concept Set Bootstrapping (From a Single Seed)

A researcher starts with a single concept and needs a comprehensive concept set. Here’s how to bootstrap it. The Scenario: You’re building an Atrial Fibrillation phenotype. You know the SNOMED concept name but need to find (a) all its SNOMED descendants and (b) related concepts you might be missing. Code Snippet: Expanding a Concept Set from a Seed
pip install omophub
Python
import omophub

client = omophub.OMOPHub()

# Step 1: Find the seed concept's OMOP concept ID
# IMPORTANT: OMOPHub uses OMOP concept IDs, not SNOMED codes directly.
# SNOMED code 49436004 = "Atrial fibrillation", but the OMOP concept ID is different.
# Always look it up first.

seed_search = client.search.basic(
    "Atrial fibrillation",
    vocabulary_ids=["SNOMED"],
    domain_ids=["Condition"],
    page_size=3,
)

seed_candidates = seed_search.get("concepts", []) if seed_search else []

if not seed_candidates:
    print("Could not find seed concept. Check spelling or vocabulary.")
else:
    seed = seed_candidates[0]
    seed_id = seed["concept_id"]
    seed_name = seed.get("concept_name", "Unknown")
    seed_code = seed.get("concept_code", "N/A")
    print(f"Seed: {seed_name} (OMOP ID: {seed_id}, SNOMED: {seed_code})\n")

    # Collect all expanded concepts
    expanded = {seed_id: seed_name}

    # Step 2: Hierarchical expansion - get all descendants
    print("--- Hierarchical Expansion (Descendants) ---")
    try:
        descendants = client.hierarchy.descendants(
            seed_id,
            max_levels=5,
            relationship_types=["Is a"],
        )
        desc_list = (
            descendants if isinstance(descendants, list)
            else descendants.get("concepts", [])
        )

        for desc in desc_list:
            d_id = desc["concept_id"]
            d_name = desc.get("concept_name", "Unknown")
            expanded[d_id] = d_name
            print(f"  - {d_name} (ID: {d_id})")

        if not desc_list:
            print("  No descendants found.")

    except omophub.APIError as e:
        print(f"  Hierarchy API error: {e.message}")

    # Step 3: Semantic search - discover related concepts
    # This catches concepts that aren't hierarchical children but are
    # clinically adjacent (e.g., "Atrial flutter" near "Atrial fibrillation")
    print("\n--- Semantic Search (Discovery) ---")
    try:
        semantic_results = client.search.semantic("atrial fibrillation related arrhythmia")
        sem_list = (
            semantic_results if isinstance(semantic_results, list)
            else semantic_results.get("concepts", [])
        ) if semantic_results else []

        # Filter to Condition domain, exclude concepts already found
        new_finds = 0
        for concept in sem_list:
            c_id = concept.get("concept_id")
            if c_id and c_id not in expanded and concept.get("domain_id") == "Condition":
                c_name = concept.get("concept_name", "Unknown")
                c_vocab = concept.get("vocabulary_id", "N/A")
                expanded[c_id] = c_name
                new_finds += 1
                print(f"  - {c_name} (ID: {c_id}, Vocab: {c_vocab})")

        if new_finds == 0:
            print("  No additional concepts found via semantic search.")

    except omophub.APIError as e:
        print(f"  Semantic search error: {e.message}")

    # Summary
    print(f"\n--- Expanded Concept Set ---")
    print(f"Total concepts: {len(expanded)} (1 seed + {len(expanded) - 1} expanded)")
    print(f"First 10: {list(expanded.values())[:10]}")
The Key Insight: The hierarchy traversal is the workhorse - it systematically captures every subtype. The semantic search is the scout - it surfaces concepts in adjacent branches or different vocabularies that hierarchy alone would miss. Together, they bootstrap a concept set in seconds that would take hours of manual ATHENA browsing.

4. Use Case B: Cross-Vocabulary Mapping for Phenotype Validation

Your SNOMED-based phenotype needs to work against ICD-10-CM billing data. If there’s no mapping, you’re leaking patients. The Scenario: You have four SNOMED concepts defining a cardiovascular phenotype. You need their ICD-10-CM equivalents to validate against billing claims. Code Snippet: Cross-Vocabulary Mapping
Python
import omophub

client = omophub.OMOPHub()

# Core phenotype concepts (OMOP concept IDs - verify these via search)
# In production, look these up rather than hardcoding.
phenotype_concepts = [
    {"name": "Atrial fibrillation", "search_term": "Atrial fibrillation"},
    {"name": "Myocardial infarction", "search_term": "Myocardial infarction"},
    {"name": "Hypertensive disorder", "search_term": "Hypertensive disorder"},
    {"name": "Chronic kidney disease", "search_term": "Chronic kidney disease"},
]

print("Cross-Vocabulary Mapping: SNOMED -> ICD-10-CM\n")

all_icd10_codes = set()

for entry in phenotype_concepts:
    # Step 1: Resolve to OMOP concept ID via search
    try:
        results = client.search.basic(
            entry["search_term"],
            vocabulary_ids=["SNOMED"],
            domain_ids=["Condition"],
            page_size=1,
        )
        candidates = results.get("concepts", []) if results else []

        if not candidates:
            print(f"  {entry['name']}: No SNOMED match found. Skipping.")
            continue

        concept = candidates[0]
        omop_id = concept["concept_id"]
        snomed_code = concept.get("concept_code", "N/A")
        print(f"  {concept.get('concept_name', entry['name'])} (OMOP: {omop_id}, SNOMED: {snomed_code})")

        # Step 2: Get ICD-10-CM mappings
        mappings = client.mappings.get(omop_id, target_vocabulary="ICD10CM")
        mapping_list = (
            mappings if isinstance(mappings, list)
            else mappings.get("concepts", mappings.get("mappings", []))
        ) if mappings else []

        if mapping_list:
            for m in mapping_list:
                icd_name = m.get("concept_name", "N/A")
                icd_code = m.get("concept_code", "N/A")
                all_icd10_codes.add(icd_code)
                print(f"    -> ICD-10-CM: {icd_name} ({icd_code})")
        else:
            print(f"    -> No direct ICD-10-CM mapping found")

    except omophub.APIError as e:
        print(f"  {entry['name']}: API error - {e.message}")

    print()

print(f"Total unique ICD-10-CM codes for phenotype: {len(all_icd10_codes)}")
print(f"Codes: {sorted(all_icd10_codes)}")
The Key Insight: This is the “leak check” for your phenotype. If your study database uses ICD-10-CM for billing data and your phenotype is defined only in SNOMED, you’ll miss every patient who was coded only in ICD-10. By mapping each SNOMED concept to its ICD-10-CM equivalents, you ensure the phenotype captures patients regardless of the coding system - and you can spot concepts with no cross-vocabulary mapping that might need manual attention.

5. Exploring Concept Relationships for Phenotype Refinement

Beyond hierarchy and cross-vocabulary mapping, OMOP concepts have rich relationships to each other: “Has associated finding,” “Has causative agent,” “Occurs after,” and more. Traversing these relationships helps you discover concepts that are clinically relevant but outside the hierarchical tree. The Idea: For a seed concept, explore its relationships to find clinically related concepts in other domains. A condition might have associated measurements (LOINC), treatments (RxNorm), or complications (SNOMED) linked through OMOP relationships.
Python
import omophub

client = omophub.OMOPHub()

# Find the seed concept (Type 2 Diabetes Mellitus)
seed_results = client.search.basic(
    "Type 2 diabetes mellitus",
    vocabulary_ids=["SNOMED"],
    domain_ids=["Condition"],
    page_size=1,
)

if seed_results and seed_results.get("concepts"):
    seed = seed_results["concepts"][0]
    seed_id = seed["concept_id"]
    print(f"Seed: {seed.get('concept_name')} (ID: {seed_id})\n")

    # Get all relationships for this concept
    try:
        relationships = client.concepts.relationships(seed_id)
        rel_list = (
            relationships if isinstance(relationships, list)
            else relationships.get("relationships", [])
        ) if relationships else []

        print(f"Found {len(rel_list)} relationships:\n")

        # Group by relationship type for clarity
        by_type = {}
        for rel in rel_list:
            rel_type = rel.get("relationship_id", "Unknown")
            if rel_type not in by_type:
                by_type[rel_type] = []
            by_type[rel_type].append(rel)

        for rel_type, rels in sorted(by_type.items()):
            print(f"  {rel_type} ({len(rels)} concepts):")
            for rel in rels[:5]:  # Show up to 5 per type
                r_name = rel.get("concept_name", "Unknown")
                r_id = rel.get("concept_id", "N/A")
                r_vocab = rel.get("vocabulary_id", "N/A")
                print(f"    - {r_name} (ID: {r_id}, Vocab: {r_vocab})")
            if len(rels) > 5:
                print(f"    ... and {len(rels) - 5} more")
            print()

    except omophub.APIError as e:
        print(f"Relationship lookup error: {e.message}")
else:
    print("Could not find seed concept.")
The Key Insight: OMOP’s relationship graph is an underused goldmine for phenotype development. A single concept can have dozens of relationships to associated findings, complications, and treatments. Traversing these relationships - rather than relying solely on hierarchy - helps build multi-domain phenotypes that capture the full clinical picture. For example, a diabetes phenotype should include not just condition codes, but also HbA1c measurements (LOINC) and antidiabetic medications (RxNorm). OMOPHub’s concept relationship API makes this exploration programmatic. A note on Phoebe: The OHDSI community has developed the Phoebe algorithm, which recommends concepts based on co-occurrence patterns in real-world OMOP data. If OMOPHub exposes Phoebe functionality (check their latest documentation), it would complement the relationship-based exploration shown above with data-driven recommendations. Phoebe is particularly valuable for identifying concepts that are empirically associated with your phenotype but not linked through formal vocabulary relationships.

6. Conclusion: Building Better Cohorts, Faster

Phenotype quality determines research quality. An incomplete concept set means an incomplete cohort, which means biased results. The traditional approach - manually browsing ATHENA, relying on expert intuition, hoping you haven’t missed a code - doesn’t scale. OMOPHub makes the vocabulary mechanics fast: hierarchy traversal to capture all subtypes, cross-vocabulary mapping to ensure portability, semantic search to discover concepts you didn’t know to look for, and relationship exploration to build multi-domain phenotypes. What used to take days of manual vocabulary work becomes a set of API calls. The clinical judgment - deciding which concepts belong in your phenotype and which don’t - still requires human expertise. But OMOPHub gives that expertise a comprehensive starting set to work with, rather than a handful of codes pulled from memory. Start with your favorite condition. Search for it, expand its descendants, map it to ICD-10-CM, and explore its relationships. You’ll almost certainly discover concepts you would have missed manually. That’s the difference between a phenotype that works and one that leaks.