1. The “Hidden Code” Problem
You’d think defining “Type 2 Diabetes” in a database would be simple. Look for ICD-10 code E11. Done, right? Not even close. In OMOP CDM data, Type 2 Diabetes is a sprawling web of concepts: the top-level SNOMED code, dozens of specific subtypes (“Type 2 DM with renal complications,” “Type 2 DM with peripheral angiopathy”), equivalent ICD-10 codes, related lab measurements (HbA1c in LOINC), medications (Metformin, insulin glargine in RxNorm), and procedures. Miss even one obscure code and your research cohort is undercounted. One major study found their diabetes phenotype captured 23% fewer patients when they used a narrow concept set versus a properly expanded one. This is the omission bias problem - and it’s endemic in observational research. The codes are there in the vocabulary. You just can’t find them all manually. Concept set expansion - the process of systematically identifying every relevant concept ID for your phenotype - is the solution. It works in two dimensions:- Vertical (hierarchical): Follow the parent-child tree. “Type 2 Diabetes Mellitus” has dozens of SNOMED descendants. You want all of them.
- Horizontal (cross-vocabulary): Map your SNOMED concepts to their ICD10CM, LOINC, and RxNorm equivalents, so your phenotype catches patients regardless of which coding system their data uses.
2. The Core Concept: Hierarchical vs. Semantic Expansion
When building a concept set, you’re doing two kinds of work: Hierarchical expansion follows the vocabulary’s built-in structure. In SNOMED, “Atrial Fibrillation” has child concepts like “Paroxysmal atrial fibrillation,” “Persistent atrial fibrillation,” and “Chronic atrial fibrillation.” OMOPHub’shierarchy.descendants() walks this tree for you - specify a concept ID, set the depth, and get back every descendant. This is the “vertical” dimension: going deeper into specificity within a single vocabulary.
Cross-vocabulary mapping is the “horizontal” dimension. Your phenotype might be defined in SNOMED, but your billing data uses ICD-10-CM. OMOPHub’s mappings.get() finds the “Maps to” relationships between vocabularies, ensuring your phenotype works regardless of how the data was originally coded.
Semantic search adds a discovery layer. If you search OMOPHub for “atrial fibrillation,” the semantic and fuzzy search methods can surface related concepts you might not have thought to include - terms worded differently, abbreviations, or clinically adjacent concepts that share a vocabulary neighborhood. This isn’t AI “understanding” the clinical meaning; it’s smart text matching against a comprehensive vocabulary index. But it’s remarkably effective at catching concepts that pure hierarchy traversal would miss.
OMOPHub handles the vocabulary lookup. The clinical judgment - deciding which concepts belong in your phenotype and which don’t - still belongs to you.
3. Use Case A: Rapid Concept Set Bootstrapping (From a Single Seed)
A researcher starts with a single concept and needs a comprehensive concept set. Here’s how to bootstrap it. The Scenario: You’re building an Atrial Fibrillation phenotype. You know the SNOMED concept name but need to find (a) all its SNOMED descendants and (b) related concepts you might be missing. Code Snippet: Expanding a Concept Set from a SeedPython
4. Use Case B: Cross-Vocabulary Mapping for Phenotype Validation
Your SNOMED-based phenotype needs to work against ICD-10-CM billing data. If there’s no mapping, you’re leaking patients. The Scenario: You have four SNOMED concepts defining a cardiovascular phenotype. You need their ICD-10-CM equivalents to validate against billing claims. Code Snippet: Cross-Vocabulary MappingPython
5. Exploring Concept Relationships for Phenotype Refinement
Beyond hierarchy and cross-vocabulary mapping, OMOP concepts have rich relationships to each other: “Has associated finding,” “Has causative agent,” “Occurs after,” and more. Traversing these relationships helps you discover concepts that are clinically relevant but outside the hierarchical tree. The Idea: For a seed concept, explore its relationships to find clinically related concepts in other domains. A condition might have associated measurements (LOINC), treatments (RxNorm), or complications (SNOMED) linked through OMOP relationships.Python