Publicly Available Health & Health Care Data

Last updated on 2023-03-13

Have you ever needed a benchmark to evaluate your program against or needed to know how many people in a state had a certain condition for opportunity sizing but not known where to start? Wondered whether your company’s performance was above/below the national average? Or wanted an easy way to find how big the market for a point solution in a given geographic market? There are data and online analysis tools of publicly available health & health care data that could be useful for businesses as they evaluate and grow their product lines. That is, if you know where to look.

While the talk of “big data” has become a cliche, one truth persists: the vastness of data sources can be daunting and it can be hard to know where should to start. And once you know, how can you get relatively straightforward insights from those data without the hassle of working with the raw data? Surely there is some solution for this by now! And there is…sort of. Some online analysis tools are quite easy to use and helpful; others are rather lacking; and some federal survey data still do not have any analysis tool other than downloading data files and using statistical software to extract findings.

I recently hosted a workshop on publicly available sources for health data, which forced me to create some frameworks and assets to guide people toward relevant data sources for their use case. I had worked with many of these data sources in my own research or teaching courses that involved an original public health data analysis (like a class I referred to as “get your thesis done”). Most academic researchers know these data well and have research questions that require working with the raw data.

This session, instead, was oriented to what people in product, operators, and strategy roles at healthcare and health tech companies need. Specifically, how to get an answer quickly, often broken down by 1 or 2 factors but not needing a more complex statistical analysis, prioritizing practical and business impact over capital-“S” scientific impact, and ideally doing it themselves without a data team or using statistical software. Since they aren’t regularly working with data sources like these, they may benefit even more from guides on where and how to start.

This post introduces some of the workshop’s assets — namely, 1. a framework to identify relevant data sources and 2. a catalog of some of those datasets’ features — and explains how they might be useful to someone wanting to get a quick answer with relatively low lift.

How to find relevant data?

Most data that are relevant for health tech & healthcare startups come from organizations within the US Department of Health and Human Services (HHS). There are some ancillary data from other agencies that could be helpful for more general demographic and economic questions, but for most part the HHS agencies are likely going to be the best bet — and some agencies more than others.

Across those agencies, there are a few common data sources that can be organized by general topic or type of data. In the best case, everyone has a data colleague to ping with an actual quick question (so few QQs to a data team are actually quick) of where to start. I made this workflow diagram / framework as a static substitute for that ideal, but unlikely, scenario of having someone to ask.

Figure 1: What kind of data do you need?

This version will open the figure in a new window so you can zoom in. Essentially the question points are:

Who do you want data on: patients/people or providers/institutions?
What level of geographic specificity do you need: state-level or national?
What population: adults or children?
- Of children, teens or younger?
What topic or type of data do you need: direct physiological markers, care utilization and costs, or patient reported outcomes?
- Of patient reported outcomes, what focus area: patient experience, mental health & substance use, or physical health?
  - Of physical health, what focus area: general health status or fertility & reproductive health?

What are the data? And can you analyze it online?

The decision tree focuses on the general topic of the data, but there are many dimensions and factors of the data that may well be important to consider about each source. What’s the sample size? How often are data collected? Is the study design longitudinal with multiple surveys of the same person over time or a a cross-sectional snapshot of a different sample of people every time?

I made a catalog / table with all of this information in the diagram below for each study. If you know already what data you need, or want to choose based on one of the factors below (e.g., sample size), this would be a good place to start.

Figure 2: What’s in the data?

And with this specific user – again, someone in product, ops, or strategy who needs a quick answer without the hassle of raw data – in mind, is there an online analysis tool? Which places can you get a good (enough) answer (relatively) quickly and easily? Many, though not all, surveys have some such tool.

The surveys with visualization tools and query functions are:

Behavioral Risk Factor Surveillance System (BRFSS)
National Health and Nutrition Examination Survey (NHANES)
Medical Expenditure Panel Survey (MEPS): Household or Individual Data
Consumer Assessment of Healthcare Providers and Systems (CAHPS)
Healthcare Costs & Utilization Project (HCUP)
National Study on Drug Use and Health (NSDUH)
National Health Interview Survey (NHIS)
National Survey of Childrens’ Health
Youth Risk Behavior Survey System (YRBSS)

What can you get from these online analysis tools?

As a glimpse of what some of these analysis tools look like and some kinds of common features, BRFSS is a good example. It has the largest sample size and the best capacity for a state or metro-area analysis of the other studies. The Prevalence and Trends Data Tool allows users to select one of the survey questions and see the state average responses displayed in a map, graph, or table. There are some additional features for more detailed analyses:

Age Adjustment: some states’ residents are older/younger than others. This tool will account for age in the prevalence estimate, which makes more even comparisons between states.
Break down average by sociodemographic factors: you can go one level deeper and get estimates by one of a few demographic factors like age, gender, race/ethnicity or socioeconomic factors like education and household income.
Look at specific answers: Some questions have multiple answers that get at different nuances of the question, like diabetes that captures gestational and pre-diabetes that you may want to have separate from a straight yes/no. Even of the questions are just yes/no, you may be more interested in one of those options over the other. This lets you pick which answer option you’re interested in.
Metro Area analyses: You can explore the data another level below states, at the Major Metropolitan Statistical Areas (MMSAs) that are an administrative grouping of population-dense urban and suburban areas.

Figure 3: What’s in the data?

Some limitations and strengths

If you are looking to get a quick and dirty number of people with condition X in one of the market areas or how common Y is to compare against, these tools likely get the job done. Most other common business questions would need the raw data and better analysis tools.
Not all online analysis tools are very intuitive to use. For example, this more sophisticate analysis tool with BRFSS data gives substantially more features and flexibility — if you know what you’re trying to make. I suspect users who don’t spend a lot of time in data dashboarding GUIs may find this a challenging way to get to a final result.
There isn’t a consistent tool used across surveys, even those administered by the same agencies. AHRQ uses tools that look effectively like white-labelled Tableau dashboards, with a good balance of features (visual and analytic) while maintaining a simple user interface. Other CDC surveys look quite different from each other: NHIS has a similar set of features than BRFSS but the time spent with one tool does not carry over to another CDC survey.

What else, what’s missing, what’s next?

This pass focused on survey data, largely federal data at the national level and of those that were likely to have online tools. What’s not included?

Claims: I didn’t focus on claims from CMS or states. I find these data are generally pay-walled or are too cumbersome to use simply. You would need a robust data analyst’s tools at hand (including time). MEPS and HCUP are the exceptions here; otherwise, you’d probably be better off working with the raw data. I may dig in here a little further since claims are something most ops and product folks are used to working with, and they are better for business questions than some of the patient reported outcomes.
Other government agency data: Ancillary data like Census, Labor department data on workforce issues, or more direct health related data from other agencies (ex: NHTSA has an EMS data source, NEMSIS) that may be useful for specific use cases.

This also focused or framed these data as part of the federal agencies that generates the data. Another approach might have started first at the search tools. For example:

Federal data search engines and data aggregators
Public data searches, that likely include data on specific topics, that may be less generalizable, and more likely requires working with the raw data:
- Kaggle
- Google Dataset Search
Academic hosts, non-profits, and other aggregators of data:
- IPUMS (demographic, economic, and health data — over time)
  - I used to teach a class that required using IPUMS data for the final data analysis project, mostly to simplify the process of finding well-documented and organized data. A former student once said “it’s like online shopping — but for data!” which is basically exactly what it is. (I’m biased — I was and still am affiliated with research centers that are related to the IPUMS data repositories).
- ICPSR (Lots of NIH funded studies’ data is hosted here)

I imagine that some extensions to this work might be:

Getting more specific: identifying common use cases to make even more precise recommendation
Evaluating the analysis tools: it might be helpful to develop and apply an evaluation framework of the analysis tools to guide users how to balance a survey’s content and its analysis tool’s usability
Creating new tools: analysts who do work with the original raw data could prioritize extending existing tools with interactive dashboard tools like Shiny or Plotly.
Adding additional data: message me if you have other data sources you think should be included

Wrapping it up

There is lots of data, generally from federally funded surveys, that can serve common business needs in healthcare and health tech. Two use cases – benchmarking your prevalence against an average, and identifying the market size for a new product – are particularly well suited to quick answers that many common data sources have. The decision tree and table/catalog of surveys and their contents hopefully help find a useful solution more quickly. Most surveys have an online analysis tools that will do the job for basic uses, and it’s worth getting to know one or two of these tools to get these sort of estimates fast and easily. They aren’t perfect, but they’re definitely better than nothing – and can avoid bugging your data team or dusting off old intro stats programming knowledge to get there yourself with the raw data.

Bea Capistrant

Research Lead & VP of Healthcare Innovation

Data Scientist focused on health, health care, and technology that makes the world better.