PNAS:一种数据集分析技术可识别从生物信息学到语言学等领域

对大数据集的自动分析可能识别出数据的模式,但是无法评估发现的模式的显著性,这可能导致无意义的结果。Claudio Luchinat及其同事发展了一种数据分析方法,它包括了一个交叉验证步骤,从而识别出最显著的模式,这种方法称为通过准确性最大化的知识发现(KODAMA)。

一个迭代过程评估了对数据的可能的分类,从而对尽可能多的数据点进行归类,并且通过合并类似的数据类,削减可能的数据类的数量。最后,定义了一个相异度矩阵从而评估数据点之间的关系。这组作者把通过准确性最大化的知识发现(KODAMA)应用到了几个数据集上,包括淋巴瘤遗传学、代谢组学和上溯到1900年的美国国情咨文的语言学特征。

对于国情咨文,这组作者报告说通过准确性最大化的知识发现(KODAMA)揭示出了在罗纳德•里根总统任期期间的一种转变,诸如“劳动”、“生产”和“开支”等词汇的频率减少,而诸如“父母”、“子女”和“改革”等词汇的频率增加。这组作者说,这些结果提示通过准确性最大化的知识发现(KODAMA)可能有能力从有噪声或复杂的数据集中提取出有意义的模式。

原文摘要:

Knowledge discovery by accuracy maximization

Stefano Cacciatore, Claudio Luchinat and Leonardo Tenori

Here we describe KODAMA (knowledge discovery by accuracy maximization), an unsupervised and semisupervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, the peculiarity of KODAMA is that it is driven by an integrated procedure of cross-validation of the results. The discovery of a local manifold’s topology is led by a classifier through a Monte Carlo procedure of maximization of cross-validated predictive accuracy. Briefly, our approach differs from previous methods in that it has an integrated procedure of validation of the results. In this way, the method ensures the highest robustness of the obtained solution. This robustness is demonstrated on experimental datasets of GENE expression and metabolomics, where KODAMA compares favorably with other existing feature extraction methods. KODAMA is then applied to an astronomical dataset, revealing unexpected features. Interesting and not easily predictable features are also found in the analysis of the State of the Union speeches by American presidents: KODAMA reveals an abrupt linguistic transition sharply separating all post-Reagan from all pre-Reagan speeches. The transition occurs during Reagan’s presidency and not from its beginning.

标签: 生物信息学 数据集分析技术 佛罗伦萨大学

作者:生物帮

;