立即打开
大数据的局限性

大数据的局限性

Clifton Leaf 2017-08-08
在那些可能有海量有用数据可供发掘的地方,我们没有为那些真正希望使用这些数据的人提供方便之门。

“每一场科学革命——从哥白尼的日心说模型到统计学和量子力学的兴起,从达尔文的进化和自然选择学说到基因理论——都是由于一件事,也只是由于一件事导致的,那就是数据的获取。”

这是达纳法伯癌症研究所生物统计学和计算生物学教授约翰·夸肯布什昨天主题演讲中令人大开眼界的开头。他也是哈佛大学陈曾熙公共卫生学院的教授,拥有诸多学术成果。

毫无疑问,这一数据概念如今正推动着医疗卫生行业几乎各个方面的转型。夸肯布什在费城的MedCity Converge大会上指出,每家医院平均每年会产生大约665TB的数据,其中五分之四都是以图片、视频或医嘱的零散形式存在的。

不过严重限制人们利用这些信息的因素,不是“大数据”,而是“混乱数据”。

总体来看,在那些可能有海量有用数据可供发掘的地方,我们没有为那些真正希望使用这些数据的人提供方便之门。那些数据可能很难或很直接地获取,或是信息量不足,或是格式不对。还有可能数据不完整,或没有使用兼容的储存“标准”(我们似乎有数不清的互相不能兼容的标准)。或者在多维度的领域里,数据只记录了一个维度的信息。(他说:“生物系统是个复杂的自适应系统,拥有许多活动的部件,我们只是刚刚了解了一些皮毛。”)

另外,这些数据并不能真正给出终端用户想要寻求的答案,这一点似乎是出人意料的普遍误解。换句话说,现有的数据没有目的性。

以人口统计数据为例,这是政府和学术机构常规收集的数据。夸肯布什表示:“统计学会使用人口数据,而医学研究也会依赖人口数据。但医疗护理却是通过个体数据推动的。所以当我们把(我们的数据研究)用于临床时,必须考虑如何让个体数据以有意义的格式储存而为人所用。”

他说,最终的目标应该是“利用不直观的数据,建立直观的图形化呈现”,从而让非数据科学家“不必坐在终端机前输入一系列晦涩的指令,就能对其展开研究”。

夸肯布什表示:“在你考虑让数据为人所用时,要做的就是建立接口,让人们能够接触并理解数据,用他们自己的想法使用数据。”

如果不这么做,我们所有的大数据就只是大型的二进制数据块和越来越大的数据服务器。

怎么阻止这种情况发生?夸肯布什坦率地说,将这些未经处理的数据变成可用数据的动机,“不是提高医疗水平或让人们过得更好。驱动力将是所有科学中最重要的一种:经济学。如果我们真的打算有所进展,就必须证明,将这种数据和信息整合起来会有利可图。”(财富中文网)

译者:严匡正

“Every revolution in science—from the Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin’s theory of evolution and natural selection to the theory of the gene—has been driven by one and only one thing: access to data.”

That was the eye-opening opening of a keynote address given yesterday by the brilliant John Quackenbush, a professor of biostatistics and computational biology at Dana-Farber Cancer Institute who has a dual professorship at the Harvard T.H. Chan School of Public Health and ample other academic credits after his name.

There is also no question that this digital fuel is driving virtually every transformation in healthcare happening today. Speaking at the MedCity Converge conference in Philadelphia, Quackenbush noted that the average hospital is generating roughly 665 terabytes of data annually, with some four-fifths of it in the unstructured forms of images, video, and doctor’s notes.

But the great limiting factor in harnessing all of this information-feedstock is not a “big data problem,” but rather a “messy data problem.”

In sum, in places where there is tons of potentially useful data to examine, we don’t make it accessible in ways that people actually want to use it. Either the data isn’t easy or intuitive to access or it simply isn’t informative. Or it’s in the wrong format. Or it’s incomplete—or created with incompatible “standards” (of which we seem to have an unlimited, irreconcilable supply). Or it captures just one dimension of a multidimensional realm. (“Biological systems are really complex, adaptive systems with many moving parts, that we’ve only begun to scratch the surface of understanding,” he says.)

Or—and this one seems to be a surprisingly common misstep—the data doesn’t really address the question the end user wants to answer. It’s off-purpose, in other words.

Take the case of population-level data, which government and academic institutions routinely collect: “Statistics operate on population data and medical research is driven by population data,” says Quackenbush, “but medical care is driven by individual-level data. So when we’re driving [our data research] to the clinic, we have to think about how we’re going to make that individual-level available in a meaningful format.”

Ultimately, the goal, he says, should be to “create intuitive graphical representations of the underlying data” in ways that allow non-data scientists “to explore it without having to sit at a terminal and type in a bunch of obscure commands.”

“What you want to think about doing when you make data available to people is to create interfaces that allow them to dive in and make sense of that data, using their own intuition,” Quackenbush says.

Without doing that, all of our growing mounds of big data will simply be big blobs on ever-bigger data servers.

What’s to stop that from happening? The incentive for turning all this raw feedstock into a usable fuel “is not going to be enhancing healthcare or making people better,” Quackenbush says flatly. “The driver is really going to be the most important ‘–omics’ science of all: which is economics. We have to show that there’s an advantage to bringing this kind of data and information together if we’re really going to make advances.”

热读文章
热门视频
扫描二维码下载财富APP