Skip to main content

1. The “Needle in a Haystack” Problem

A Phase III diabetes trial with 200 empty slots and six months to fill them. The patients exist - their data is sitting right there in the OMOP CDM. The problem? The eligibility criteria say things like “Adults with HbA1c > 7.0% despite Metformin therapy” - and translating that into queries across condition_occurrence, measurement, and drug_exposure tables, with the right concept IDs, the right value thresholds, and the right temporal logic, takes weeks of manual work per trial. This is the bottleneck in clinical trial recruitment. Not a lack of patients - a lack of infrastructure to match patients to trials at scale. Solving this requires two things working together: (1) an NLP or LLM system to parse the eligibility criteria text into structured components (conditions, drugs, measurements, thresholds, temporal constraints), and (2) a vocabulary API to resolve those components into standardized OMOP concept IDs that you can actually query against. OMOPHub handles the second part. It gives you instant API access to the full OHDSI ATHENA vocabulary - SNOMED for conditions, RxNorm for drugs, LOINC for measurements - so you can resolve “Type 2 Diabetes Mellitus” to concept ID 201826, “Metformin” to its RxNorm ingredient, and “HbA1c” to its LOINC code, all without maintaining a local vocabulary database. But OMOPHub’s real superpower for trial screening is concept set expansion. A trial criterion says “Type 2 Diabetes.” But your patients might be coded as “Type 2 DM with renal complications,” “Type 2 DM with peripheral angiopathy,” or a dozen other specific variants. Simple ID matching would miss them. OMOPHub’s hierarchy API lets you expand a single concept into all its descendants - catching every patient who should qualify, regardless of how specifically they were coded.

2. The Core Concept: From Criteria Text to OMOP Queries

Automating trial eligibility screening is a multi-step process. Here’s how the pieces fit together: Step 1 - Parse the criteria (NLP/LLM). Take the eligibility text from the protocol and extract structured components. A criterion like “patients with Type 2 Diabetes Mellitus and HbA1c > 7.0%, currently on Metformin” decomposes into:
  • Condition: “Type 2 Diabetes Mellitus” (inclusion)
  • Measurement: “HbA1c” > 7.0% (inclusion)
  • Drug: “Metformin” (inclusion, current exposure)
This step requires an NLP tool or LLM (tools like Criteria2Query, or prompting Claude/GPT with the protocol text). OMOPHub does not do this step - it’s a vocabulary API, not a text parser. Step 2 - Resolve to OMOP concepts (OMOPHub). Take each extracted entity and look up its standardized OMOP concept ID via OMOPHub’s search API:
  • “Type 2 Diabetes Mellitus” → SNOMED concept ID 201826
  • “Metformin” → RxNorm ingredient concept ID
  • “HbA1c” → LOINC concept ID
Step 3 - Expand concept sets (OMOPHub). Use hierarchy.descendants() to build complete concept sets. “Type 2 Diabetes Mellitus” has dozens of child concepts in SNOMED. You want to match patients coded with any of them. Step 4 - Query the OMOP database. Apply the resolved, expanded concept IDs against the patient database (condition_occurrence, drug_exposure, measurement tables) with the appropriate value/temporal filters. OMOPHub owns Steps 2 and 3. That’s where it adds the most value - turning clinical terms into queryable concept sets without the overhead of a local vocabulary database.

3. Use Case A: Resolving Parsed Criteria to OMOP Concept Sets

Suppose your NLP step has already extracted the key entities from the trial protocol. Now you need OMOP concept IDs - and not just the top-level concept, but the full expanded set of descendants. The Scenario: A trial requires patients with “Type 2 Diabetes Mellitus” and current “Metformin” use. You need concept sets for both. Code Snippet: Resolving Criteria Entities and Expanding Concept Sets
pip install omophub
Python
import omophub

client = omophub.OMOPHub()

# These entities were extracted from the trial protocol by an NLP/LLM step
# (OMOPHub does NOT do the extraction - it resolves terms to concept IDs)
parsed_criteria = {
    "inclusion_conditions": ["Type 2 Diabetes Mellitus"],
    "inclusion_drugs": ["Metformin"],
    "inclusion_measurements": ["Hemoglobin A1c"],
}

resolved_criteria = {
    "condition_concept_ids": set(),
    "drug_concept_ids": set(),
    "measurement_concept_ids": set(),
}

print("Resolving parsed criteria to OMOP concept sets:\n")

# --- Resolve Conditions ---
for condition_name in parsed_criteria["inclusion_conditions"]:
    try:
        results = client.search.basic(
            condition_name,
            vocabulary_ids=["SNOMED"],
            domain_ids=["Condition"],
            page_size=3,
        )
        candidates = results.get("concepts", []) if results else []

        if candidates:
            top_match = candidates[0]
            top_id = top_match["concept_id"]
            print(f"  Condition: '{condition_name}'")
            print(f"    -> {top_match['concept_name']} (ID: {top_id})")

            # Expand concept set: get all descendants
            descendants = client.hierarchy.descendants(top_id, max_levels=5)
            desc_list = descendants if isinstance(descendants, list) else descendants.get("concepts", [])

            resolved_criteria["condition_concept_ids"].add(top_id)
            for desc in desc_list:
                resolved_criteria["condition_concept_ids"].add(desc["concept_id"])

            print(f"    -> Expanded to {len(resolved_criteria['condition_concept_ids'])} concepts (including descendants)")
        else:
            print(f"  Condition: '{condition_name}' -> No SNOMED match found")

    except omophub.APIError as e:
        print(f"  API error for '{condition_name}': {e.message}")

# --- Resolve Drugs ---
for drug_name in parsed_criteria["inclusion_drugs"]:
    try:
        results = client.search.basic(
            drug_name,
            vocabulary_ids=["RxNorm"],
            domain_ids=["Drug"],
            page_size=3,
        )
        candidates = results.get("concepts", []) if results else []

        if candidates:
            # Find the ingredient-level concept
            ingredient = None
            for c in candidates:
                if c.get("concept_class_id") == "Ingredient":
                    ingredient = c
                    break
            if not ingredient:
                ingredient = candidates[0]

            drug_id = ingredient["concept_id"]
            print(f"  Drug: '{drug_name}'")
            print(f"    -> {ingredient['concept_name']} (ID: {drug_id}, Class: {ingredient.get('concept_class_id', 'N/A')})")

            # Expand: get all drug products containing this ingredient
            descendants = client.hierarchy.descendants(drug_id, max_levels=3)
            desc_list = descendants if isinstance(descendants, list) else descendants.get("concepts", [])

            resolved_criteria["drug_concept_ids"].add(drug_id)
            for desc in desc_list:
                resolved_criteria["drug_concept_ids"].add(desc["concept_id"])

            print(f"    -> Expanded to {len(resolved_criteria['drug_concept_ids'])} concepts (ingredient + products)")
        else:
            print(f"  Drug: '{drug_name}' -> No RxNorm match found")

    except omophub.APIError as e:
        print(f"  API error for '{drug_name}': {e.message}")

# --- Resolve Measurements ---
for meas_name in parsed_criteria["inclusion_measurements"]:
    try:
        results = client.search.basic(
            meas_name,
            vocabulary_ids=["LOINC"],
            domain_ids=["Measurement"],
            page_size=3,
        )
        candidates = results.get("concepts", []) if results else []

        if candidates:
            top_match = candidates[0]
            meas_id = top_match["concept_id"]
            print(f"  Measurement: '{meas_name}'")
            print(f"    -> {top_match['concept_name']} (ID: {meas_id}, LOINC: {top_match.get('concept_code', 'N/A')})")
            resolved_criteria["measurement_concept_ids"].add(meas_id)
        else:
            print(f"  Measurement: '{meas_name}' -> No LOINC match found")

    except omophub.APIError as e:
        print(f"  API error for '{meas_name}': {e.message}")

# --- Summary ---
print(f"\nResolved Criteria Summary:")
print(f"  Condition concept set: {len(resolved_criteria['condition_concept_ids'])} concepts")
print(f"  Drug concept set: {len(resolved_criteria['drug_concept_ids'])} concepts")
print(f"  Measurement concept set: {len(resolved_criteria['measurement_concept_ids'])} concepts")
print(f"\nNote: Numerical thresholds (e.g., HbA1c > 7.0%) and temporal logic")
print(f"(e.g., 'for at least 1 year') must be handled in the database query layer.")
The Key Insight: The concept set expansion is what makes this practical. Without it, you’d miss every patient coded with a specific subtype of Type 2 Diabetes. With it, a single concept ID becomes a comprehensive set that catches all clinically equivalent patients. This is the difference between finding 50 candidates and finding 500 - and OMOPHub makes it an API call instead of a local database query.

4. Use Case B: Patient Pre-screening Against Trial Criteria

Once you have expanded concept sets from Use Case A, screening a patient becomes straightforward set logic: does the patient’s OMOP profile overlap with the trial’s required concepts? The Scenario: A patient is in the clinic. Their OMOP record has condition, drug, and measurement concept IDs. You need to check if they match the trial criteria. Code Snippet: Pre-screening a Patient
Python
import omophub

client = omophub.OMOPHub()

# Patient's OMOP profile (from their EHR/OMOP database)
patient_profile = {
    "condition_ids": {201826, 4329847, 443727, 40484648},
    # 201826 = Type 2 DM, 4329847 = MI, 443727 = Essential HTN, 40484648 = CKD
    "drug_ids": {1503297},
    # 1503297 = Metformin (example concept ID)
}

# Trial criteria (resolved + expanded from Use Case A)
# In production, these would be the full expanded concept sets
trial_criteria = {
    "required_condition_ids": {201826},      # Type 2 DM (would be expanded set in production)
    "required_drug_ids": {1503297},          # Metformin
    "excluded_condition_ids": {432571},      # e.g., Liver disease
}

print("Patient Pre-screening:\n")

# Check inclusion conditions
conditions_met = trial_criteria["required_condition_ids"].issubset(patient_profile["condition_ids"])
drugs_met = trial_criteria["required_drug_ids"].issubset(patient_profile.get("drug_ids", set()))

# Check exclusion conditions
has_exclusion = bool(trial_criteria["excluded_condition_ids"] & patient_profile["condition_ids"])

print(f"  Required conditions met: {conditions_met}")
print(f"  Required drugs met: {drugs_met}")
print(f"  Has exclusion condition: {has_exclusion}")

if conditions_met and drugs_met and not has_exclusion:
    print("\n  RESULT: Patient is a POTENTIAL CANDIDATE for this trial.")
    print("  (Pending measurement checks - HbA1c threshold, etc.)")
else:
    # Look up names of unmet criteria for clear reporting
    unmet = []
    if not conditions_met:
        missing = trial_criteria["required_condition_ids"] - patient_profile["condition_ids"]
        unmet.extend(list(missing))
    if not drugs_met:
        missing = trial_criteria["required_drug_ids"] - patient_profile.get("drug_ids", set())
        unmet.extend(list(missing))

    if unmet:
        try:
            details = client.concepts.batch(list(unmet))
            detail_list = details if isinstance(details, list) else details.get("concepts", [])
            print(f"\n  RESULT: Patient does NOT meet criteria.")
            print(f"  Missing:")
            for concept in detail_list:
                print(f"    - {concept.get('concept_name', 'Unknown')} (ID: {concept['concept_id']})")
        except omophub.APIError as e:
            print(f"\n  RESULT: Patient does NOT meet criteria. Missing IDs: {unmet}")

    if has_exclusion:
        excluded = trial_criteria["excluded_condition_ids"] & patient_profile["condition_ids"]
        try:
            details = client.concepts.batch(list(excluded))
            detail_list = details if isinstance(details, list) else details.get("concepts", [])
            print(f"  Exclusion criteria triggered:")
            for concept in detail_list:
                print(f"    - {concept.get('concept_name', 'Unknown')} (ID: {concept['concept_id']})")
        except omophub.APIError:
            print(f"  Exclusion criteria triggered: IDs {excluded}")
The Key Insight: The core matching is just set operations. OMOPHub’s value is in two places: (a) building the concept sets in the first place (Use Case A), and (b) providing human-readable concept names for the screening report - so the clinician sees “Missing: Metformin” instead of “Missing: concept ID 1503297.” That translation from IDs to names is small but critical for clinical adoption.

5. The “Explainable Matching” Layer

A simple “eligible / not eligible” isn’t enough for clinicians. They need to understand why - especially for exclusions. Was the patient excluded because of a permanent contraindication, or something that could change (like a medication that could be washed out)? This is where pairing OMOPHub’s structured vocabulary data with an external LLM creates real clinical value. Example Workflow:
  1. OMOPHub identifies that the patient has concept ID 4329847 (Myocardial Infarction), which is an exclusion criterion
  2. OMOPHub provides the structured metadata: concept name, domain, vocabulary
  3. You feed this to an LLM with the trial context to generate a clinician-facing explanation
Python
# After OMOPHub resolves the exclusion reason:
exclusion_data = {
    "concept_name": "Myocardial infarction",
    "concept_id": 4329847,
    "vocabulary": "SNOMED",
    "domain": "Condition",
    "criterion_type": "exclusion",
    "trial_criterion_text": "Patients with history of myocardial infarction within the past 6 months",
}

# Feed to an external LLM for a clinician-facing explanation:
prompt = f"""
A patient was flagged during clinical trial screening.

Exclusion reason: The patient has a recorded {exclusion_data['concept_name']}
(SNOMED concept {exclusion_data['concept_id']}), which triggers the following
trial exclusion criterion: "{exclusion_data['trial_criterion_text']}"

Provide a 2-sentence clinical explanation for the treating physician,
including whether this exclusion could become eligible with time.
"""

# llm_response = your_llm_client.complete(prompt)
# Expected: "This patient is currently excluded due to a history of
# myocardial infarction, which falls under the trial's cardiovascular
# safety exclusion. If the MI occurred more than 6 months ago, the
# patient may become eligible - verify the event date in the clinical record."
The Key Insight: OMOPHub provides the structured vocabulary backbone (concept names, IDs, relationships). The LLM provides the clinical reasoning and natural language generation. Together, they transform an opaque “excluded” flag into an actionable explanation that helps the clinician make a decision. This is how you build trust in automated screening systems.

6. Conclusion: Accelerating the Path to “First Patient In”

Clinical trial recruitment is a pipeline problem. The patients are there. The data is there. What’s missing is the infrastructure to connect them efficiently. OMOPHub addresses two critical bottlenecks in that pipeline: resolving clinical terms from trial protocols into standardized OMOP concept IDs, and expanding those concepts into comprehensive concept sets that catch patients regardless of coding specificity. The first turns “Metformin” into a queryable ID. The second turns “Type 2 Diabetes” into a set of 50+ concept IDs that catches every relevant patient. Combined with NLP for criteria parsing and an LLM for explainable matching, you get a screening pipeline that takes days instead of weeks - and produces results that clinicians can understand and act on. Start with Use Case A: take one inclusion criterion from a trial on ClinicalTrials.gov, extract the clinical terms, and run them through OMOPHub’s search and hierarchy APIs. See how many descendant concepts you get. That expanded concept set is the difference between a trial that struggles to recruit and one that finds its patients.