Peptides, short chains of amino acids, are fundamental building blocks in biology, showing a remarkable duality of simplicity and versatility. Their sequences dictate their properties and behaviors, enabling them to self-assemble into diverse structures such as fibers, hydrogels, and nanotubes. These assemblies play crucial roles in biological systems and have applications in drug delivery, tissue engineering, and catalysis. However, the factors driving peptide self-assembly—spanning sequence, concentration, pH, and solvent—remain poorly understood due to the fragmented nature of existing data. What governs the self-assembly of peptides into specific structures? Our work addresses this question by integrating literature mining, machine learning, and systematic analysis to uncover the deeper principles behind peptide self-assembly. We curated a dataset of over 1,000 experimental entries from academic literature, capturing peptide sequences, experimental conditions, and resulting assembly phases. We were able to connect the dots between a large set of distinct studies and elucidate principles that cut across individual experimental results, forming a unifying model. We achieved this by: ➡️ Augmented Literature Mining: Fine-tuned a LLM on manually curated data to extract detailed experimental parameters with significantly improved accuracy. This approach accelerated what is typically a time-consuming process while maintaining high fidelity to complex, nuanced data. ➡️ Machine Learning for Phase Prediction: Developed ML models to predict self-assembly phases based on peptide sequences and experimental stimuli. Our model demonstrated over 80% accuracy, providing actionable insights into how variables like peptide concentration and solvent conditions influence assembly outcomes. ➡️ Iterative Improvement: Designed a workflow where newly extracted data can augment the dataset, continuously refining both the ML models and our understanding of peptide self-assembly mechanisms. Broader Implications While peptides are made up of just a few amino acids, their ability to self-assemble into highly ordered and functional structures—such as hydrogels, nanotubes, and even crystals—is astonishing. This phenomenon is driven not by strong chemical bonds, but by weak, non-covalent interactions like hydrogen bonding, van der Waals forces, and hydrophobic effects. By combining human expertise with machine intelligence, we not only accelerate discovery but also promote deeper reflection on the governing rules of complex systems. We believe that many other fields could benefit from such a systematic integration of human insight and machine learning. Code: PeptideMiner, https://lnkd.in/eWBSTDjK Paper: Zhenze Yang, Sarah Yorke, Tuomas Knowles, Markus J. Buehler, Learning the rules of peptide self-assembly through data mining with large language models, https://lnkd.in/ekFQ3giK, 2024
Biological Systems Modeling
Conheça conteúdos de destaque no LinkedIn criados por especialistas.
-
-
Molecular dynamics simulation of an entire cell By Jan A. Stevens, Fabian Grünewald, P. A. Marco van Tilburg, Melanie König, Benjamin R. Gilbert, Troy A. Brier, Zane R. Thornburg, Zaida Luthey-Schulten, Siewert J. Marrink Abstract The ultimate microscope, directed at a cell, would reveal the dynamics of all the cell’s components with atomic resolution. In contrast to their real-world counterparts, computational microscopes are currently on the brink of meeting this challenge. In this perspective, we show how an integrative approach can be employed to model an entire cell, the minimal cell, JCVI-syn3A, at full complexity. This step opens the way to interrogate the cell’s spatio-temporal evolution with molecular dynamics simulations, an approach that can be extended to other cell types in the near future. https://lnkd.in/gtnntTK6 doi: 10.3389/fchem.2023.1106495 The illustration is a snapshot of the complete Figure 2 "Whole-cell Martini model of JCVI-syn3A" from the paper and is made available under a CC BY 4.0 International license.
-
Most biodiversity footprint assessment tools & databases share a common weakness: the EXIOBASE model. So it is excellent news that after 4 years, it received a major update! 📖 EXIOBASE is an Environmentally-Extended Multi-Regional Input-Output model (EEMRIO). Put more simply, it's a bunch of tables describing the economy, including trade between countries, and it includes a lot of data on the associated environmental impacts. 🔦 Used by most tools & databases Whether they mention it or not, most of the tools used by corporates and financial institutions to assess their impacts on biodiversity and the associated transition risks rely on EXIOBASE when only financial data are available. Let me highlight my last sentence as it's quite important: when commodities, pressures (e.g. water consumption, land occupation, etc.) or biodiversity state data are collected, many tools directly use them and do not rely on EXIOBASE's proxies based on financial data. But when there are gaps, EXIOBASE fill them. Users of EXIOBASE include ENCORE v2 (which uses it to infer upstream dependencies or impacts), the #GlobalBiodiversityScore (GBS) or the Biodiversity Footprint for Financial Institutions (BFFI). Together with the pressure-impact models (GLOBIO, ReCiPe, LC Impact...), EXIOBASE is one of the core and common components of all those tools. Its limitations are thus their limitations (again, when only financial data are available). 🔎 Why is EXIOBASE so preponderant? It has been built to assess with as much precision as possible the environmental impacts of economic activities. As a consequence, its sectoral granularity is almost unmatched (except maybe by Eora) especially for high-impact sectors such as agriculture (sub-divided by crops) or electricity production (sub-divided by fuel or source, e.g. solar or wind). It also has a great regional granularity with 49 regions. And its environmental data are rich with GHG emissions, commodity consumption, etc. It is also free, even though its licence does not allow all uses. Commercial licences are possible. 🆕 What's in EXIOBASE 3.9.5? Released on 14 February 2025, it contains major improvements: 1️⃣ Timeliness: all data updated to 2020 (vs 2011 earlier). With annual update planned (and already nowcasting up to 2024 in beta). This is the most significant. Using 2011 data was really starting to make no sense as the economy has deeply changed since then. All the tools & databases should quickly update to EXIOBASE 3.9.5. 2️⃣ More reliable GHG emissions factors, directly available as Scope 1, 2, 3 consistent with the GHG protocol 3️⃣ Other improvements hard to explain beyond specialists but: it's great! 💬 Do you know which data are used in your biodiversity footprint assessments? Are they up to EXIOBASE 3.9.5's quality?
-
Can we use #LCA to measure a product system's impact on #biodiversity ❓ The answer is yes❗ - How reliable are these calculations? Well, that is up for discussion. The impact on biodiversity should always be measured in situ by surveying the species richness of and ecosystem and in combination with other techniques usually including local communities' knowledge. - Why do I think so? Because ecosystems are essentially unique everywhere we look, the impact of a substance emission or material extraction from nature (elementary flows) varies from region to region. It is different to perform a given activity in an urban area than in a rainforest. However, in the last decade, new Life Cycle Impact Assessment methods have been developed to account for regional differences in the impact on biodiversity. They typically focus on assessing the impacts of #landuse and land-use change, as these are among the most significant drivers of biodiversity loss. They may quantify impacts in terms of potentially disappeared fractions of species (PDF) over a certain area and time (usually m2/year) or use other metrics to estimate the change in species richness or ecosystem quality. Some of the methods that include approaches to assess biodiversity impacts are: ➖ ReCiPe: a comprehensive LCIA method that includes a model for assessing land use impacts on biodiversity through the PDF metric. It aims to quantify species loss over a certain area and time due to land use. ➖ IMPACT World+Endpoint: This method includes an attempt to integrate biodiversity impacts through several impact categories such as the PDF from freshwater acidification, damage to ecosystem quality from changes in the soil pH, marine acidification, ecotoxicity, land transformation and occupation, water pollution, and water availability. It is one of the most complete. ➖ USEtox: focused on toxicological impacts, includes considerations for ecotoxicity, which indirectly affects biodiversity by assessing the potential toxic impacts on aquatic and terrestrial species. ➖ Land use biodiversity (Chaudhary et al., 2015): recommended by the UNEP-SETAC Life Cycle Initiative: "The indicator represents regional species loss taking into account the effect of land occupation displacing entirely or reducing the species that would otherwise exist on that land, the relative abundance of those species within the ecoregion, and the overall global threat level for the affected species." I love this method because includes regional factors. ➖ Global Biodiversity Score (GBS): not a traditional LCIA method, GBS is a tool developed to help companies assess their impact on biodiversity. Using a common metric, it translates pressures from organizational activities into impacts on biodiversity. We need to think way beyond #carbonfootprint to aim for a #sustainable world. Biodiversity loss is that issue that although highly interlinked with #climatechange, is the actual major environmental issue we face.
-
Can large language models be used in biotech? The short answer is yes. While LLMs are often associated with chatbots, their capabilities extend beyond that. In biotech, much of the data comes in the form of sequences – like nucleotides in DNA, or amino acids in proteins. Similar to sentences in natural language, these biological sequences have unique semantic meanings based on the arrangement of their components. When input data is fed into an LLM, a transformer converts these sequences into contextual vectors using its attention mechanism. This process allows the model to understand the context and relationships within the data, enabling it to predict subsequent elements. One such use case is prediction of neoantigens that enable targeting tumor cells in personalized cancer immunotherapies. Neoantigens are tumor-specific mutated peptides presented on the surface of tumor cells because they bind to human leukocyte antigen (HLA) molecules. LLMs can predict this binding affinity. This allows the development of personalized therapies that use the patient's own immune system to kill tumor cells without damaging healthy tissues.
-
Physics-Informed Neural ODEs with Scale-Aware Residuals for Learning Stiff Biophysical Dynamics Kamalpreet Singh Kainth, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedat Panat https://lnkd.in/dtRD_ide This paper tackles a critical challenge in using Neural ODEs for modeling stiff biophysical systems, specifically the difficulty in achieving stable and accurate long-term predictions. The core idea is to improve training stability and accuracy by incorporating physics-informed regularization with scale-aware residual normalization within a Neural ODE framework. The authors propose "PI-NODE-SR," which combines two key elements: 1. Physics-Informed Regularization: They leverage the known governing equations (e.g., Hodgkin-Huxley) to define a physics-informed loss term that penalizes deviations from the expected behavior. This is a standard PINN approach, but the key is how they handle the residual. 2. Scale-Aware Residual Normalization: This is the novel contribution. Stiff systems often involve state variables evolving on vastly different timescales. Directly applying a physics-informed loss can lead to one timescale dominating the training process, hindering convergence and accuracy. To address this, they normalize the residual associated with each state variable by a scaling factor. This factor is chosen to balance the contributions of each variable to the overall loss, preventing the faster dynamics from overwhelming the slower ones. They use a low-order explicit solver (Heun method) to calculate the residual. The authors demonstrate the effectiveness of PI-NODE-SR on the Hodgkin-Huxley equations, showing that it can learn from a single oscillation and extrapolate accurately over longer time horizons. Importantly, they show that the method can recover morphological features in the gating variables that are typically only captured by higher-order solvers. ----- This work directly addresses a significant limitation of Neural ODEs in scientific applications: their struggle with stiff systems. By introducing scale-aware residual normalization, the authors provide a principled way to stabilize training and improve the accuracy of long-term predictions. This is particularly relevant for: - Biophysical modeling: Accurately simulating neuronal dynamics, cardiac electrophysiology, and other complex biological processes. - Chemical kinetics: Modeling reaction networks with widely varying reaction rates. - Multi-physics simulations: Situations where different physical processes evolve on different timescales. The ability to use lower-order solvers with neural correction to achieve accuracy comparable to higher-order methods has significant implications for computational efficiency. While the method's sensitivity to initialization is noted, the overall approach offers a promising direction for developing more robust and efficient Physics-Informed Neural ODEs for a wide range of scientific applications.
-
Our DINO (dynamics-informed dataset to overcome the limitations of static molecular data in AI-driven drug discovery) proposal is public! Static molecular structures are useful, but they miss the dynamics that underlie real molecular function. Most AI models to date are trained on static data, causing them to suffer from state bias, miss key motions, and generate biologically implausible candidates. As a result, generated candidates often lack the biological plausibility needed for molecules to be a drug and require costly experimental validation. Can this be improved by embedding molecular dynamics directly into AI models? Victor Greiff, Rahmad Akbar, and I propose DINO, a dynamics-informed dataset designed to bridge static structural data and real molecular behavior in AI-driven drug discovery. By integrating experimental and synthetic molecular dynamics across proteins, antibodies, nucleic acids, small molecules, and complexes, DINO aims to "embed molecular motion into AI-driven design" and capture "biophysically realistic dynamics and functional behavior". Data obtained in this way can enable the learning of conformational ensembles, binding energetics, and functional kinetics in the next generation of AI models. By grounding molecular design in thermodynamic principles, DINO can enable AI systems to "move beyond static assumptions and generate biochemically plausible candidates with higher therapeutic potential", supporting mechanistic, uncertainty-aware modeling and more biologically realistic therapeutic design in silico. Read the proposal here: https://lnkd.in/e2tdz3gJ
-
🧵 1/ In high-dimensional bio data—transcriptomics, proteomics, metabolomics—you're almost guaranteed to find something “significant.” Even when there’s nothing there. 2/ Why? Because when you test 20,000 genes against a phenotype, some will look like they're associated. Purely by chance. It’s math, not meaning. 3/ Here’s the danger: You can build a compelling story out of noise. And no one will stop you—until it fails to replicate. 4/ As one paper put it: “Even if response and covariates are scientifically independent, some will appear correlated—just by chance.” That’s the trap. https://lnkd.in/ecNzUpJr 5/ High-dimensional data is a story-teller’s dream. And a statistician’s nightmare. So how do we guard against false discoveries? Let’s break it down. 6/ Problem: Spurious correlations Cause: Thousands of features, not enough samples Fix: Multiple testing correction (FDR, Bonferroni) Don’t just take p < 0.05 at face value. Read my blog on understanding multiple tests correction https://lnkd.in/ex3S3V5g 7/ Problem: Overfitting Cause: Model learns noise, not signal Fix: Regularization (LASSO, Ridge, Elastic Net) Penalize complexity. Force the model to be selective. read my blog post on regularization for scRNAseq marker selection https://lnkd.in/ekmM2Pvm 8/ Problem: Poor generalization Cause: The model only works on your dataset Fix: Cross-validation (k-fold, bootstrapping) Train on part of the data, test on the rest. Always. 9/ Want to take it a step further? Replicate in an independent dataset. If it doesn’t hold up in new data, it was probably noise. 10/ Another trick? Feature selection. Reduce dimensionality before modeling. Fewer variables = fewer false leads. 11/ Final strategy? Keep your models simple. Complexity fits noise. Simplicity generalizes. 12/ Here’s your cheat sheet: Problem : Spurious signals Fixes: FDR, Bonferroni, feature selection Problem: Overfitting Fixes:LASSO, Ridge, cross-validation Problem: Poor generalization Fixes: Replication, simpler models 13/ Remember: The more dimensions you have, the easier it is to find a pattern that’s not real. A result doesn’t become truth just because it passes p < 0.05. 14/ Key takeaways: High-dim data creates false signals Multiple corrections aren’t optional Simpler is safer Always validate Replication is king 15/ The story you tell with your data? Make sure it’s grounded in reality, not randomness. Because the most dangerous lie in science... is the one told by your own data. I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter chatomics to learn bioinformatics https://lnkd.in/erw83Svn
-
Synthetic biology is - quite literally - our future. A goundbreaking new biological foundation model Evo2 achieves state-of-the-art prediction of genetic variation impacts and generates coherent genome sequences, spanning all domains of life. A diverse team from leading research institutions including Arc Institute Stanford University NVIDIA University of California, Berkeley trained the model on 9.3 trillion DNA base pairs and has fully shared all code, parameters, and data. A few highlights from the paper (link in comments) 🔬 Zero-shot prediction achieves state-of-the-art accuracy in genetic variant interpretation. Evo 2 can predict the functional consequences of genetic mutations across all domains of life without specialized training. It surpasses existing models in assessing the pathogenicity of both coding and noncoding variants, including BRCA1 cancer-linked mutations. This generalist capability suggests Evo 2 could revolutionize genetic disease research, reducing reliance on expensive, manually curated datasets. 🛠 Genome-scale generation paves the way for synthetic life design. Evo 2 can generate full-length genome sequences with realistic structure and function, including mitochondrial genomes, bacterial chromosomes, and yeast DNA. Unlike prior models, Evo 2 ensures natural sequence coherence, improving synthetic biology applications like engineered microbes or artificial organelles. This sets the stage for programmable biology at an unprecedented scale. 🧬 Unprecedented long-context understanding revolutionizes genomic analysis. Evo 2 operates with a context window of up to 1 million nucleotides—far beyond the capabilities of previous models—allowing it to analyze genomic features across vast distances. This ability enables it to accurately identify regulatory elements, exon-intron boundaries, and structural components critical for understanding genome function. Its long-context recall is a major breakthrough for interpreting complex biological sequences. 🎛 Inference-time search enables controllable epigenomic design. Evo 2’s generative abilities extend beyond raw DNA sequence to epigenomic features, allowing researchers to design sequences with specific chromatin accessibility patterns. This approach successfully encoded Morse code messages into synthetic epigenomes, demonstrating a new method for controlling gene regulation via AI. This could lead to breakthroughs in gene therapy and epigenetic engineering. 🔮 Future potential: Toward AI-driven biological design and virtual cell modeling. Evo 2 represents a major leap toward AI-powered genomic engineering. Future iterations could integrate additional biological layers—such as transcriptomics and proteomics—to create virtual cell models that simulate complex cellular behaviors. This could revolutionize drug discovery, genetic therapy, and even synthetic life creation.
-
Quietly, something important just happened: the Ecosystem Integrity Index (EII) is now open source. Every team can now reliably set targets and report against a comparable, trustworthy measure of ecosystem health to their stakeholders. That might sound technical. It’s not. It’s a fundamental shift in ecosystem visibility. For years, one of the biggest constraints on nature-based finance hasn’t been intent or capital, it’s been measurement. We’ve had strong data on pressures (land use, climate, extraction). We’ve had deep insight into species responses. But we’ve lacked a unified, scalable, publicly available way to assess ecosystem condition itself. EII begins to change that. This release includes: – Global coverage at 300m resolution – Current-state ecosystem mapping – A Python API for accessible data use – Google Earth Engine integration for scalable analysis Why this matters: – You can’t price risk you can’t see. – You can’t invest at scale without comparability. – And you can’t govern what you can’t measure. Open, system-level metrics like this don’t just support better science, they unlock better markets, better decisions, and better incentive structures. Check it out: https://lnkd.in/gSPQcRwg