一家人工智能公司，攻克了50年未解的医学难题

Jeremy Kahn

2021-07-20

DeepMind能够对大多数蛋白质类型做出十分精确的预测。

文本设置

小号

默认

大号

Plus(0条)

总部位于伦敦的人工智能公司DeepMind在去年年底攻克了一个长达50年的科学难题，通过使用人工智能软件，仅根据蛋白质的遗传密码即可预测其折叠形状，该公司于近日公布了具体细节。

蛋白质的形状很重要，因为它有助于判断蛋白质的功能。大多数药物通过与蛋白质结构中具有某一特定形状的“口袋”结合起作用。因此，弄清楚蛋白质的确切形状可能是药物开发过程中的关键一步，DeepMind的突破或有助于加快药物的研发过程。

蛋白质的形状通常使用某种成像方法确定。X射线晶体学是其中最精确的方法之一，通过将蛋白质溶液结晶，然后被高能X射线轰击，对由此产生的衍射模式进行分析，从而构建出蛋白质的图像。但这种方法昂贵、耗时，有时让人倍感焦虑。近年来，也出现了其他方法，例如在极低的温度下急速冷冻蛋白质，再通过电子显微镜进行观察。

但早在1972年，诺贝尔奖得主、化学家克里斯蒂安•安芬森就提出，仅仅通过蛋白质的DNA序列，就可以准确预测其折叠成的确切形状。然而，凭借当时的计算方法、基因测序技术、以及计算能力（这点同样十分重要），还无法解决这种复杂的相关性问题。

1994年，开始每两年举办一次名为蛋白质结构关键评估（Critical Assessment of Protein Structure）的软件竞赛，比赛内容是通过基因序列来预测蛋白质结构。2018年，谷歌（Google）母公司Alphabet旗下的DeepMind公司首次使用深度学习系统参加了比赛。深度学习系统是一种使用神经网络的人工智能，一种以人脑连接方式为基本框架的软件。DeepMind的系统名为AlphaFold，轻松击败了其他所有团队，虽然仍远未达到X射线晶体学的精度，但已经在预测精度上取得了巨大飞跃。

2020年，DeepMind携重新设计的深度学习系统AlphaFold 2再次入围。这一次，DeepMind能够对大多数蛋白质类型做出十分精确的预测，最终不仅赢得了比赛，蛋白质结构关键评估竞赛的组织者还宣布，DeepMind基本上解决了安芬森最初提出的蛋白质结构预测问题。

7月16日，在著名科学期刊《自然》（Nature）上发表的一篇同行评议文章中，DeepMind具体解释了其人工智能软件为何可以有如此出色的表现。它还开放了AlphaFold 2的代码供其他研究人员使用。

该公司此前曾经表示，可能会开发一个界面，让学术研究人员甚至制药公司能够通过 AlphaFold 2来查询蛋白质的结构预测，但该公司尚未宣布任何类似计划。Deepmind之外的科学家即使拥有源代码，却仍然需要自己训练神经网络，才可以得到有意义的蛋白质结构预测结果。

“我们承诺，将分享我们的方法，并为科学界提供范围广泛的免费使用途径。”DeepMind的联合创始人及首席执行官德米斯•哈萨比斯在一份声明中说。“今天，我们向承诺迈出了第一步。”哈萨比斯表示，关于如何让更多人获取AlphaFold2的预测，公司“很快”会通报更多进展。

在《自然》杂志的论文里，DeepMind写道，AlphaFold 2已经帮助使用X射线晶体学和蛋白质电子显微镜图像方式的研究人员完善了他们对数据内容的理解。该系统还能够准确预测和新冠病毒有关的一些关键蛋白质的形状。

该论文显示，AlphaFold 2使用的神经网络设计很复杂。该网络包含两个大模块，配合完成蛋白质结构的预测。

第一个模块被DeepMind称为“Evoformer”，负责读取蛋白质的原始基因序列，以及该DNA密码的哪些片段与其他结构已知的蛋白质中的片段共同进化的数据。Evoformer将这些数据以图表的方式呈现，图表以氨基酸对作为节点，用边缘表示这些氨基酸对在蛋白质中彼此之间的接近程度。Evoformer有48个神经网络“块”，每个“块”可能由多层网络组成。

每个神经块使用各种先进的机器学习技术对这张图表进行一系列处理，再将其预测传递给下一个神经块做进一步修订。通过这种方式，Evoformer逐渐完成了对蛋白质主干形状的预测。该系统使用的一些技术与最近自然语言处理取得的突破中使用的技术类似。

随后，Evoformer将其预测传递给第二个模块，即结构预测模块。该模块由另外8个神经网络块组成，通过一系列几何变换，进一步细化蛋白质可能的形状。特别的是，这个模块构建了蛋白质可能的“侧链”的图像，在蛋白质的抽象3D图像中，这些侧链看起来像是从蛋白质主干分支出来的扭曲的带状花体。

DeepMind在其论文中指出，尽管AlphaFold 2对大多数已知蛋白质结构的精确度达到了不足一个原子宽度的距离，但在一些领域内却仍然存在瓶颈。对于已知在蛋白质间共同进化的基因序列少于30个的蛋白质，AlphaFold的准确性大幅下降。DeepMind称，这种共同进化信息“对于在网络早期阶段大致找到正确的结构是必要的。”

研究人员还表示，该系统对某些蛋白质的预测不佳，因为它们的形状很大程度上是由侧链之间的相互作用决定的，而不是沿着主干，或者包括两条大相径庭的氨基酸链相互交织。但科学家们还写道，“我们预计”运用AlphaFold的理念，未来将能够准确预测这种复杂的蛋白质结合，或许在暗示DeepMind可能已经在这个问题上取得了幕后进展。（财富中文网）

译者：Agatha

该论文显示，AlphaFold 2使用的神经网络设计很复杂。该网络包含两个大模块，配合完成蛋白质结构的预测。

译者：Agatha

DeepMind, the London-based artificial intelligence company, has published further details of how it solved a 50-year-old scientific challenge late last year, using A.I. software to predict the shape into which proteins would fold based solely on their genetic code.

The shape of a protein is important because it helps determine that protein’s function. Most drugs work by binding to very specifically shaped “pockets” within the structure of a protein. So knowing the exact shape of the protein can be a critical step in the development of new pharmaceuticals, and DeepMind’s breakthrough has the potential to accelerate drug discovery.

The shape of a proteins is usually determined using some kind of imaging method. One of the most accurate is X-ray crystallography, in which a solution of proteins is crystallized and then bombarded with high-powered X-rays and the resulting diffraction patterns analyzed to build up a picture of the protein. But the method is expensive, time-consuming, and sometimes fraught. More recently, other methods have been used, such as flash-freezing the proteins at extremely low temperatures and then examining them in electron microscopes.

But back in 1972, Nobel laureate chemist Christian Anfinsen postulated that it should be possible to accurately predict the exact shape a protein will fold into just by looking at its DNA sequence. At the time, however, the computational methods, the gene sequencing techniques, and just as important, the computing power, to work out such complex correlations did not exist.

A biennial contest for software that could accurately predict protein structure from genetic sequences, called the Critical Assessment of Protein Structure (or CASP) competition, began in 1994. In 2018, DeepMind—which is owned by Google parent-company Alphabet—entered the competition for the first time using a deep-learning system, a kind of artificial intelligence that uses neural networks: software that is loosely based on the way connections in the human brain work. DeepMind’s system, which it called AlphaFold, handily beat all the other teams, making a big leap forward in prediction accuracy, although it was still far from equaling the accuracy of X-ray crystallography.

Last year, DeepMind entered again with a redesigned deep-learning system, AlphaFold 2. This time it was able to make predictions that were so accurate across most protein types that not only did the A.I. company’s team win the contest, the CASP organizers themselves declared that DeepMind had essentially solved the protein structure prediction problem as Anfinsen had first formulated it.

On July 16, in a peer-reviewed paper published in the prestigious scientific journal Nature, DeepMind offered further details of how exactly its A.I. software was able to perform so well. It has also open-sourced the code it used to create AlphaFold 2 for other researchers to use.

The company has said previously that it may develop an interface that would allow academic researchers and possibly even pharmaceutical companies to simply query AlphaFold 2 for protein structure predictions, but the company has not yet announced any such access. Having the source code would still require non-DeepMind scientists to train the neural network themselves before they could derive useful protein structure predictions.

“We pledged to share our methods and provide broad, free access to the scientific community,” Demis Hassabis, DeepMind’s cofounder and chief executive officer, said in a statement. “Today we take the first step toward delivering on that commitment.” Hassabis promised to share more updates “soon” on the company’s progress toward making AlphaFold2’s predictions more widely available.

In its Nature paper, DeepMind wrote that AlphaFold 2 has already helped those who study X-ray crystallography and electron microscope images of proteins to better refine their understanding of what they are seeing in that data. The system has also already proven that it can accurately predict the shape of some key proteins associated with SARS-CoV-2, the virus that causes COVID-19.

The design of the neural network used in AlphaFold 2, according to the Nature paper, is complicated. It consists of two large modules that work together to create a prediction of a protein’s structure.

The first module, which DeepMind calls Evoformer, takes in both the protein’s raw genetic sequence and data about which parts of that DNA code have co-evolved with those found in other proteins for which there is a known structure. The Evoformer then represents the data as a graph, in which the nodes of the graph are amino-acid pairs and the edges of the graph represent the proximity of those pairs to one another in the protein. This Evoformer has 48 neural network “blocks,” each of which might consist of multiple layers of the network.

Each of these blocks performs a series of manipulations of this graph, using a variety of state-of-the-art machine-learning techniques, before passing its prediction along to the next block for further revision. In this way, the entire Evoformer gradually refines a forecast for what the backbone of the protein should look like. Some of the techniques the system uses are similar to those that underpin recent breakthroughs in natural language processing.

The Evoformer then passes its prediction to a second module, called the Structure Prediction Module. Consisting of eight more neural network blocks, it performs a series of geometric transformations to further refine the protein’s likely shape. In particular, this module builds up a picture of the protein’s likely “side chains,” which in abstracted 3D images of proteins appears as twisty, ribbonlike curlicues that branch off from the main protein backbone.

DeepMind noted in its paper that while AlphaFold 2 achieved accuracy to within a fraction of an atom’s width of distance for a majority of known protein structures, there were still some areas where it struggled. For proteins where there were fewer than 30 genetic sequences that are known to have co-evolved across proteins, AlphaFold’s accuracy dropped substantially. DeepMind said it thought this co-evolution information was “needed to coarsely find the correct structure in the early stages of the network.”

The researchers also said the system did not perform as well for certain kinds of proteins where their shape is largely determined by interactions between the side chains rather than along the backbone, or that consisted of the intertwining of two very different amino-acid chains. But the scientists also wrote that “we expect” the same ideas used in AlphaFold will be able to accurately predict such complex protein bindings in the future, hinting that perhaps DeepMind has already made progress on this problem behind the scenes.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。

0条Plus

精彩评论

撰写或查看更多评论

请打开财富Plus APP

前往打开

热读文章

关注我们

一家人工智能公司，攻克了50年未解的医学难题

撰写或查看更多评论