1. Rule 1: Deduplicate Before You Map
This is the single most impactful optimization. A dataset with 10 million patient records might contain only 3,000 unique diagnosis codes. Map the 3,000, not the 10 million.Python
2. Rule 2: Use Batch Endpoints
OMOPHub provides batch and bulk endpoints that process multiple items in a single HTTP request. Each batch request counts as one API call, regardless of how many items are in the batch. Available batch and bulk endpoints:| Endpoint | Purpose |
|---|---|
POST /v1/concepts/batch | Retrieve up to 100 concepts by ID |
POST /v1/concepts/map/batch | Map up to 100 source codes or source concept IDs to a target vocabulary |
POST /v1/concepts/hierarchy/batch | Batch ancestor / descendant lookups |
POST /v1/concepts/relationships/batch | Batch relationship queries |
POST /v1/search/bulk | Run multiple search queries in one request |
POST /v1/search/semantic-bulk | Batch semantic search with embeddings |
POST /v1/fhir/resolve/batch | FHIR Resolver batch: up to 100 codings per request |
Python
3. Rule 3: Cache What’s Stable
Vocabulary data changes only when OHDSI publishes a new ATHENA release (typically every 2 to 3 months). Concept metadata you look up today will return the same result tomorrow, next week, and next month until the next release. Design your caching accordingly:Python
- File cache (JSON, SQLite) for single-machine pipelines
- Redis cache with a TTL aligned to your vocabulary update cycle (60-90 days)
- Database table (
source_to_concept_map) for team-wide shared caches. See the Collaborative Mapping guide.
4. Rule 4: Use the Right Search Endpoint
Different search endpoints have different characteristics. Use the most specific one for your use case.| Endpoint | Best for |
|---|---|
GET /v1/concepts/{concept_id} | Direct lookup when you already have the OMOP concept ID |
GET /v1/concepts/by-code | Lookup by vocabulary code (e.g. ICD-10 E11.9) when you know the vocabulary |
GET /v1/search/concepts | Keyword / full-text search with filters |
GET /v1/search/autocomplete | Prefix matching for search-as-you-type UIs |
GET /v1/concepts/semantic-search | Natural-language or fuzzy matching via embeddings |
by-code instead of text search. Save semantic search for when the user query is ambiguous or phrased in natural language - it handles synonyms, abbreviations, and clinical descriptions, but is meaningfully slower than keyword search.
5. Rule 5: Build Autocomplete Responsibly
If you’re powering a search-as-you-type UI with OMOPHub: Debounce aggressively. Don’t fire an API call on every keystroke. Wait 300-500ms after the user stops typing before sending the request. Use the autocomplete endpoint.GET /v1/search/autocomplete is optimized for prefix matching and returns faster than full search.
Set a minimum query length. Don’t search for single characters. Require at least 3 characters before triggering a search.
Cache recent results client-side. If the user types “diab”, gets results, then types “diabe”, the “diab” result set is a superset of “diabe” matches and can be filtered locally without another request.
JavaScript
6. Rule 6: Limit Hierarchy Depth
Concept hierarchies can be deep. SNOMEDIs a trees sometimes traverse 20+ levels for highly specialized terms. For most phenotype definitions, 3 to 5 levels is plenty.
Python
max_levels matters for both latency and for the number of concepts you pay to pull back. Cap it at the clinical depth you actually need.
7. Rule 7: Handle Errors and Retries
OMOPHub returns standard HTTP status codes. Build retry logic for transient failures only:| Status | Meaning | Action |
|---|---|---|
200 | Success | Process response |
400 | Bad request | Fix your request, do not retry |
401 | Unauthorized | Check your API key |
404 | Not found | Concept or code does not exist, do not retry |
429 | Rate limited | Back off and retry after the retry-after header |
5xx | Server error | Retry with exponential backoff |
Python
8. Putting It All Together: ETL Pipeline Pattern
Here’s the recommended shape for a production ETL pipeline:Extract unique source codes
Pull a distinct list of codes from your source data. This is your mapping
input, not the full patient dataset.
Check your local cache
Before hitting the API, check if you’ve already mapped each code in a
previous run. Load your
source_to_concept_map cache.Batch-resolve the cache misses
For codes not in the cache, use the batch mapping endpoint
(
POST /v1/concepts/map/batch) or the FHIR Resolver batch
(POST /v1/fhir/resolve/batch). Chunk into groups of 100.Apply mappings to the full dataset
Join the mapping cache against your source data via local lookup
(pandas merge, SQL JOIN, dict lookup). No API calls needed for this step.
For more on the full end-to-end pattern, see Lean ETL Mapping Cache and Collaborative Mapping.