Data Sources
NPPES NPI Registry
The primary data source is the NPPES (National Plan and Provider Enumeration System) NPI Registry, published by the Centers for Medicare & Medicaid Services (CMS). Every U.S. healthcare provider that bills Medicare or Medicaid is required to obtain a National Provider Identifier (NPI) — a unique 10-digit number that serves as a permanent identifier for the provider.
CMS publishes the full registry as public domain bulk data files:
- npidata_pfile — Main file: ~8 million records, one per NPI. Contains name, entity type, taxonomy codes, practice address, enumeration date, and more.
- endpoint_pfile — Digital contact endpoints (Direct addresses, FHIR endpoints) for providers.
- othername_pfile — Other/former names for provider organizations.
- pl_pfile — Secondary practice locations beyond the primary address.
DoctorDataHub uses the main npidata_pfile as its primary source. The full file is approximately 10 GB uncompressed and contains ~300 columns per record.
NUCC Taxonomy Codes
Healthcare provider specialties are encoded using NUCC (National Uniform Claim Committee) Health Care Provider Taxonomy Codes. Each provider can have up to 15 taxonomy codes in the NPPES data, one of which is flagged as primary.
We map taxonomy codes to human-readable specialty labels (e.g., 207Q00000X → "Family Medicine") using the standard NUCC taxonomy reference.
Update Schedule
CMS publishes weekly incremental updates and monthly full replacement files for the NPPES registry. Our ETL pipeline processes the monthly full file to rebuild the database with the latest data. The "Data last updated" timestamp in the site footer reflects when our most recent import completed.
Data Processing
We process the raw NPPES CSV using a Python ETL pipeline that:
- Parses each row and extracts the primary taxonomy code (where Primary Taxonomy Switch = 'Y')
- Normalizes provider names and organizations
- Builds a PostgreSQL
tsvectorsearch index over name and location fields - Upserts records by NPI — existing records are updated in-place
No data is modified or enriched beyond what CMS publishes. We do not add, infer, or correct any provider information.