"Probing Multidomain Architecture Design with Stochastic Sampling and Language Models"
|
Abstract: Multidomain proteins are mosaics of structural or functional modules called domains. The architecture of a multidomain protein - that is, its domain composition in N- to C-terminal order - is intimately related to its function, with each module playing a distinct functional role. The processes of domain insertion, duplication, and deletion enable evolutionary discovery of diverse domain architectures. Nevertheless, only a tiny fraction of possible domain combinations are observed in nature, suggesting that domain order and co-occurrence are highly constrained. Here, we present new methods for investigating these constraints.
|
We first introduce a stochastic model of domain architecture evolution. This model is implemented in DomArchov, a simulator that uses data-driven transition probabilities to capture the forces acting on domain gain and loss. Second, we adapt methods from information retrieval and natural language processing to model domain architecture composition. In this framework, domain architectures are represented as vectors in a multidimensional space. Sets of domain architectures can be compared by superimposing the corresponding sets of points. Distances between points quantify the relationship between domain architectures. This can be extended to set-wise distances for the quantitative comparison of sets of domain architectures. Using this framework, we demonstrate that the agreement between genuine and simulated domain architectures exceeds chance expectation, suggesting that DomArchov encodes a realistic model of architectural constraints. Our framework promises broad applicability beyond simulator performance assessment. We are currently investigating the use of this framework to compare sets of domain architectures across genomes and across functional classes.
|
|
|
|