Evaluating AI-Generated Biological Synthesis Against Experimentally Validated Public Datasets
AI systems are increasingly used to synthesize biological literature, but useful synthesis should be judged by whether its claims can be operationalized and checked against experimental data. We evaluated a claim-level protocol for testing AI-generated biological synthesis against public cancer datasets. The system generated twelve structured claims spanning two validation domains: CRISPR gene dependency in DepMap 24Q2 and pharmacogenomic drug sensitivity in GDSC release 8.5. Each claim was mapped to an explicit biomarker or lineage cohort, an experimental response variable, and a direction-aware concordance statistic. Six of seven dependency claims were supported after multiple-testing correction, including WRN dependency in MSI-high models, PAX8 dependency in ovary/fallopian-tube models, SOX10/MITF dependency in skin melanoma models, and BRAF/KRAS oncogene dependency in hotspot-mutant contexts. Drug-response synthesis was less reliable: BRAF mutation strongly predicted RAF inhibitor sensitivity and KRAS mutation partially predicted MEK inhibitor sensitivity, but ERBB2 mutation alone did not recover ERBB2 inhibitor sensitivity and BRCA1/2 hotspot mutation alone did not recover PARP inhibitor sensitivity. One EGFR claim was not directly verifiable because the positive cohort was too small. These results show that AI-generated synthesis can recover strong, established preclinical relationships, but that plausible biomedical language often hides biomarker mismatch, sparse cohorts, and missing biological context. The appropriate role is exploratory claim generation followed by explicit public-data validation.
Reviews
This paper proposes and pilots a claim-level evaluation protocol for AI-generated biological “synthesis,” judging usefulness by whether generated statements can be operationalized into testable hypotheses and checked against large public experimental datasets (DepMap 24Q2 CRISPR dependencies; GDSC 8.5 drug response). The central strength is the framing: forcing structured claims (biomarker/cohort, response variable, directionality, and a concordance statistic) is a concrete, falsifiable alternative to subjective “does the summary sound right?” evaluation. The reported pattern of results—many dependency claims supported after multiple-testing correction, while drug-response claims are more brittle due to biomarker mismatch/sparsity/context—also feels directionally consistent with known properties of these datasets and of how biomedical language can under-specify causal context. The main weakness is that, from the provided excerpt, key methodological details needed to assess rigor and reproducibility are missing or underspecified: how claims were generated (prompting, model version, temperature, sampling, filtering), whether the 12 claims were pre-registered or cherry-picked, the exact cohort/biomarker definitions (e.g., MSI-high calling; “hotspot” definition; lineage mapping), the response variables and statistics used (AUC/IC50? gene effect? what “direction-aware concordance” exactly is), and the multiple-testing procedure and family of hypotheses being corrected. With only 12 claims and no references/citations supporting dataset handling choices, the quantitative conclusion should be phrased more cautiously as a pilot demonstration rather than evidence of general AI synthesis reliability; however, the qualitative conclusion (“use for exploratory claim generation followed by explicit public-data validation”) is justified and appropriately conservative.