Open to interesting problems

I build pipelines that move clinical and life-science data from messy reality into systems people can trust.

View experienceGet in touch ↗
01

Experience

Ten years across clinical research, biostatistics, and data engineering.


2026 — Present

Data Engineer

Sprinter Health · Menlo Park, CA

Building data infrastructure for in-home preventive care.

PythonSparkAWS
2022 — 2026

Data Engineer II

Verana Health · San Francisco, CA

Authored post-processing components and Airflow orchestration for LLM inference pipelines, integrating Databricks preprocessing with AWS Bedrock batch inference — the org's first production-ready LLM inference capability. Optimized a GPU-intensive SageMaker pipeline to a Spark UDF using spaCy, reducing cost 90%+. Migrated EKS pipelines to Databricks for a 50% processing-time reduction.

DatabricksAWS BedrockAirflowPySparkDelta LakespaCy
2021 — 2022

Data Scientist / Data Engineer

Thermo Fisher Scientific · South San Francisco, CA

Designed event-driven ETL pipelines from multiple sources to AWS. Python and Java services with CI/CD to serverless. Root-cause analysis using statistical tools, results presented to cross-functional teams.

PythonJavaAWSETLR
2018 — 2021

Biostatistician / Data Scientist

Stanford University — QSU, School of Medicine · Palo Alto, CA

Biostatistics and informatics on the Apple Heart Study (n > 400,000). Built end-to-end demographics + retention ETL on GCP serving an R Shiny dashboard. Trained BERT and LSTM models in PyTorch for clinical-note classification.

BERTPyTorchGCPRShiny
2018 — 2021

Clinical Research Data Analyst II

Department of Veterans Affairs — PAVIR · Palo Alto, CA

Statistical models and hypothesis testing on the VA Corporate Data Warehouse for manuscript submissions.

RStatistics
2015 — 2018

Quantitative Research Analyst I

Sutter Health — PAMF · Palo Alto, CA

Healthcare-disparity research across 17 racial/ethnic groups and 50 metrics on the full PAMF adult EMR population, using GLMs in SAS.

SASHealthcareGLMs
02

Selected Projects

Work I've led or contributed to — at companies, in research, on the side. Click any line to expand.


03

Publications


2020
Rhythm classification from a large remote, prospective cohort

JAMA Cardiology · Apple Heart Study Investigators incl. S. Gummidipundi

Full list on Google Scholar ↗
04

Architecture Diagrams

System designs from production work — hover nodes to highlight connections.


Verana Health — LLM inference pipeline

Production LLM inference: Databricks preprocessing → AWS Bedrock batch inference, orchestrated by Airflow.

EXTERNALClinical sourcesEHR exportsSTORERaw lakeS3 / DeltaSERVICEDatabricks prepPySpark UDF (spaCy)QUEUEBedrock queueBatch jobsSERVICEAWS BedrockLLM inferenceSERVICEPost-processingValidation + taggingSTORECurated martsDelta tablesINFRAAirflowOrchestration
05

Skills


Programming

PythonRSQLShell

Data / Frameworks

SparkAirflowDelta LakeTerraformKubernetesDocker

Cloud

DatabricksAWS BedrockAWS SageMakerGCP BigQueryGCP Cloud RunAzure

ML / NLP

PyTorchspaCyBERTLSTMs
06

Now

A few things on my plate.


  • Ramping at Sprinter Health.
  • Building symbol-screen on the side.
  • Advising a marketing law firm on data infrastructure.
  • Re-reading Designing Data-Intensive Applications.
07

Contact


Open to interesting problems and conversations. Reach out via email or find me on GitHub and LinkedIn.

santosh@santoshg.ioGitHubLinkedIn

© 2026 Santosh Gummidipundi · santosh@santoshg.io