Data Science to Optimize Cardiovascular Disease Prevention (2024-2025)


Cardiovascular diseases are the leading causes of death and long-term disability globally. Tragically, there are pervasive inequities in death and disability from cardiovascular disease. While some factors like smoking, diabetes or high cholesterol are known to cause cardiovascular disease, there are many other medical conditions and environmental factors that could increase the risk but have not been studied well. Furthermore, some medications decrease cardiovascular risk, but others increase it. Access and use of these medications is not equitable, and these factors likely intersect with other drivers of risk. 

It is essential to disentangle these complex relationships to figure out how to prevent and treat cardiovascular disease successfully. With the advent of electronic medical records, which contain detailed information about patient demographics, medication use and chronic medical conditions, a new source of data could be leveraged to understand cardiovascular disease better. Utilizing this data in innovative ways will allow for clinicians and policymakers to effectively target patients at greatest risk for cardiovascular disease for preventive interventions.

Project Description

This project team will use large, national electronic health record databases to improve the understanding of risk factors and treatment strategies for cardiovascular disease. For example, the Truveta electronic health record database contains information on more than 100 million patients, including structured information such as medical diagnosis codes, laboratory values and prescription drug information, and free-text medical notes that can be studied using large language models.

Team members will use this data to identify new targets to prevent and predict risk of cardiovascular disease; understand how common medications are linked with cardiovascular disease risk; and take advantage of natural experiments (such as adding medications to clinical practice guidelines) to understand the comparative effectiveness of different management strategies for cardiovascular disease. 

The team will form three subteams that correspond with the above aims:

  1. Preventing and predicting cardiovascular disease: Team members will develop and validate a predictive model that can utilize electronic health record data to predict major cardiovascular disease. Additionally, the team will perform exploratory studies of risk factors for cardiovascular disease using hierarchical clustering methods to identify clusters of conditions that convey a higher risk of cardiovascular disease. The team will then test these clusters or individual risk factors using propensity matched methods positive and negative control comparison cohorts. This work will lead to the identification of clinically relevant novel risk factors through use of the rich data available in electronic health records.
  2. Pharmacoepidemiology of cardiovascular risk: Team members will identify patterns of concurrent medication use that are associated with increased or decreased cardiovascular risk. While prior studies have evaluated individual medications before, many classes of medications have synergistic or antagonistic effects that could give rise to unique risk profiles, especially among patients with baseline cardiovascular disease. Team members will utilize causal inference strategies to test the most promising medication/medication combinations. This aim will help identify new medications that may have previously unknown cardiovascular risk, or which may be protective against cardiovascular diseases.
  3. Causal inference for cardiovascular disease management: Team members will apply causal inference strategies to cardiovascular care, examining the use of medications to treat heart failure or applying regression discontinuity designs to study the effect of arbitrary cutoffs for cardiovascular therapy eligibility (and other related questions). This work will inform policy and clinical practice by assessing how effective prior policy approaches have been to cardiovascular disease prediction, management and care.

Anticipated Outputs

Peer-reviewed publications; grant applications; poster and oral presentations for the conferences

Student Opportunities

Ideally, this team will include 3 graduate students and 6 undergraduates with interests in disciplines such as computer science, economics, public policy, mathematics, statistical science, sociology, biology, public health, data science, medicine and/or allied health.

Team members will participate in a structured series of learning activities that will advance their knowledge and skills in data science, statistics, applied epidemiology and population health science. All team members will have an initial consultation meeting with the team leaders to develop an individualized learning plan and mechanisms for feedback. They will also have access to computational resources within the departments of Neurology, Statistical Sciences, and Population Health Sciences, and support and one-on-one mentorship in statistical analysis, manuscript development and oral presentation of results. Students will be able to participate in seminars, research days and journal clubs through the Department of Neurology and be offered the chance to participate in clinical shadowing as relevant to their interests.

The team will meet biweekly in the Bryan Research Building, with more frequent meetings at the beginning of the semester and as needed to complete tasks.

In Summer 2024, several students will be selected to work over 4-8 weeks to conduct a literature review, perform preliminary descriptive and/or exploratory analyses, and prepare analytical datasets.

See also the related Data+ project for Summer 2024; there is a separate application process for students who are interested in this optional summer component. 


Summer 2024 – Spring 2025

  • Summer 2024 (optional): Conduct literature review; perform preliminary descriptive/exploratory analyses, prepare analytical datasets; complete related Data+ work (separate application)
  • Fall 2024: Devise statistical analysis plans; commence statistical analysis; participate in structure seminar series on research methods and clinical implications
  • Spring 2025: Compete statistical analysis; prepare early draft of manuscripts


Academic credit available for fall and spring semesters; summer funding available

See related Data+ summer project, Data Science to Optimize Cardiovascular Disease Prevention (2024).


Image: Coronary CT angiography of coronary arteries, by Oxford Academic Cardiovascular CT Core Lab and Lab of Inflammation and Cardiometabolic Diseases at NHLBI, NIH Image Gallery, , licensed under CC BY-NC 2.0

Image: Coronary CT angiography of coronary arteries, by Oxford Academic Cardiovascular CT Core Lab

Team Leaders

  • Fan Li, School of Medicine-Biostatistics and Bioinformatics|Arts & Sciences-Statistical Science
  • Jay Lusk, School of Medicine-Medicine;Neurology;Population Health Sciences
  • Brian Mac Grory, School of Medicine-Neurology;Ophthalmology

/yfaculty/staff Team Members

  • Bradley Hammill, School of Medicine-Population Health Sciences
  • Ryan McDevitt, Fuqua School of Business
  • Emily O'Brien, School of Medicine-Duke Clinical Research Institute;Neurology;Population Health Sciences|Margolis Center for Health Policy
  • Nishant Shah, School of Medicine-Medicine: Cardiology