Skip to main content
Every OMOP team faces this question: should we download ATHENA and run our own vocabulary database, or use an API? This page gives you an honest comparison so you can decide what fits your situation.

1. The Self-Hosting Path

The traditional approach:
  1. Go to athena.ohdsi.org and request a vocabulary download
  2. Wait for the download link (can take hours to days)
  3. Download 3–5 GB of CSV files
  4. Set up a PostgreSQL database with the OMOP vocabulary schema
  5. Load the CSVs (typically 30–60 minutes depending on hardware)
  6. Build indexes for acceptable query performance (another 30–60 minutes)
  7. Write SQL queries or build a service layer on top
  8. Repeat steps 1–6 every time OHDSI publishes a new release
This works. Thousands of OHDSI sites run this way. But it has real costs.

2. Where Self-Hosting Gets Expensive

Setup time is not zero. A senior data engineer typically spends 1–2 days on initial setup, including schema creation, CSV loading, index tuning, and basic query testing. For teams new to OMOP, it can take a week. Maintenance is ongoing. ATHENA publishes vocabulary updates every 6 months. Each update means re-downloading, re-loading, re-indexing, and regression testing. Teams that skip updates end up with stale vocabularies - deprecated concepts, missing new codes, broken mappings. See Vocabulary Lifecycle Management for the pattern to stay current. No search out of the box. ATHENA CSVs give you tables, not a search engine. Building fuzzy search, autocomplete, or semantic similarity requires additional tooling - Elasticsearch, custom indexing, neural embedding models. Most teams never build this, so they’re stuck with exact-match SQL queries. No API without building one. If your ETL scripts, FHIR server, LLM pipeline, or frontend application need vocabulary access, you have to build and maintain a REST API on top of your database. That’s a web framework, auth, rate limiting, caching, monitoring, and deployment - for every team, from scratch. Scales with your team, not your problem. Every new developer, every new project, every new environment needs access to the vocabulary database. That means either shared database access (operationally risky) or multiple copies (expensive and prone to version drift).

3. What OMOPHub Gives You Instead

CapabilitySelf-hosted ATHENAOMOPHub
Setup time1–2 days5 minutes (get an API key)
Vocabulary updatesManual re-download and re-loadAutomatic, synced with ATHENA releases
Full-text searchBuild your ownBuilt-in
Semantic searchBuild your own (need an embedding model)Built-in (neural embeddings)
AutocompleteBuild your ownBuilt-in
REST APIBuild your ownBuilt-in
Python SDKBuild your ownpip install omophub
R SDKBuild your owninstall.packages("omophub")
MCP Server for AI agentsBuild your ownnpx -y @omophub/omophub-mcp
FHIR Terminology ServiceBuild your own or deploy Echidna/SnowstormBuilt-in ($lookup, $translate, $validate-code, $expand, $subsumes, $find-matches, $closure, $diff)
FHIR Concept Resolver (Coding → OMOP + CDM table)Not a standard OHDSI tool; build your ownBuilt-in (POST /v1/fhir/resolve)
Batch operationsSQLBuilt-in batch endpoints - see Batch & Performance
Phoebe recommendationsRequires separate setupBuilt-in via property=recommended on $lookup
Infrastructure cost$150–400/month (database + compute)Free tier available; paid tiers for higher volume
Maintenance burdenOngoingZero

4. When Self-Hosting Still Makes Sense

OMOPHub is not the right choice for every situation:
  • Air-gapped environments where no external API calls are permitted. Though the Lean ETL Mapping Cache guide shows a hybrid approach - use OMOPHub during development, cache the results, deploy locally.
  • Custom vocabulary extensions where you’ve added proprietary concepts to your local OMOP vocabulary tables. OMOPHub serves standard ATHENA content only.
  • Extremely high volume workloads that exceed API rate limits and where latency requirements demand sub-millisecond local lookups. For most ETL workloads, the batch endpoints and caching strategies in Batch & Performance handle this comfortably.
  • Regulatory requirements that explicitly prohibit sending vocabulary queries to an external service, even when no PHI is involved. See Security & Data Handling for what actually flows through the API - spoiler: vocabulary codes, not patient data.

5. The Hybrid Approach

Many teams use both. OMOPHub for development, exploration, and ETL building - with a local vocabulary cache for production execution. The Lean ETL Mapping Cache guide walks through this pattern in detail. This gives you the best of both worlds: fast iteration with OMOPHub’s search and mapping capabilities during development, and zero external dependencies in production.