Santosh Gummidipundi

Open to interesting problems

I build pipelines that move clinical and life-science data from messy reality into systems people can trust.

View experience Get in touch ↗

Experience

Ten years across clinical research, biostatistics, and data engineering.

2026 — Present

Data Engineer

Sprinter Health · Menlo Park, CA

Building data infrastructure for in-home preventive care.

PythonSparkAWS

2022 — 2026

Data Engineer II

Verana Health · San Francisco, CA

Authored post-processing components and Airflow orchestration for LLM inference pipelines, integrating Databricks preprocessing with AWS Bedrock batch inference — the org's first production-ready LLM inference capability. Optimized a GPU-intensive SageMaker pipeline to a Spark UDF using spaCy, reducing cost 90%+. Migrated EKS pipelines to Databricks for a 50% processing-time reduction.

DatabricksAWS BedrockAirflowPySparkDelta LakespaCy

2021 — 2022

Data Scientist / Data Engineer

Thermo Fisher Scientific · South San Francisco, CA

Designed event-driven ETL pipelines from multiple sources to AWS. Python and Java services with CI/CD to serverless. Root-cause analysis using statistical tools, results presented to cross-functional teams.

PythonJavaAWSETLR

2018 — 2021

Biostatistician / Data Scientist

Stanford University — QSU, School of Medicine · Palo Alto, CA

Biostatistics and informatics on the Apple Heart Study (n > 400,000). Built end-to-end demographics + retention ETL on GCP serving an R Shiny dashboard. Trained BERT and LSTM models in PyTorch for clinical-note classification.

BERTPyTorchGCPRShiny

2018 — 2021

Clinical Research Data Analyst II

Department of Veterans Affairs — PAVIR · Palo Alto, CA

Statistical models and hypothesis testing on the VA Corporate Data Warehouse for manuscript submissions.

RStatistics

2015 — 2018

Quantitative Research Analyst I

Sutter Health — PAMF · Palo Alto, CA

Healthcare-disparity research across 17 racial/ethnic groups and 50 metrics on the full PAMF adult EMR population, using GLMs in SAS.

SASHealthcareGLMs

Selected Projects

Work I've led or contributed to — at companies, in research, on the side. Click any line to expand.

Publications

2020

Rhythm classification from a large remote, prospective cohort

JAMA Cardiology · Apple Heart Study Investigators incl. S. Gummidipundi

2019

Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation

NEJM · Perez M.V. et al.

Full list on Google Scholar ↗

Architecture Diagrams

System designs from production work — hover nodes to highlight connections.

Verana Health — LLM inference pipeline

Production LLM inference: Databricks preprocessing → AWS Bedrock batch inference, orchestrated by Airflow.

Skills

Programming

PythonRSQLShell

Data / Frameworks

SparkAirflowDelta LakeTerraformKubernetesDocker

Cloud

DatabricksAWS BedrockAWS SageMakerGCP BigQueryGCP Cloud RunAzure

ML / NLP

PyTorchspaCyBERTLSTMs

Now

A few things on my plate.

●Ramping at Sprinter Health.
●Building symbol-screen on the side.
●Advising a marketing law firm on data infrastructure.
●Re-reading Designing Data-Intensive Applications.

Contact

Open to interesting problems and conversations. Reach out via email or find me on GitHub and LinkedIn.

santosh@santoshg.io GitHub LinkedIn