行业研究报告题录
信息传输、软件和信息技术服务业(2015年第9期)
(报告加工时间:2015-06-01 -- 2015-06-10)

行业资讯

境外分析报告

中文技术报告

  • 面向海量数据流的基于密度的簇结构挖掘算法
    提出一种基于密度的簇结构挖掘算法(mining density-based clustering structure over data streams,简称MClu Stream),以解决数据流密度聚类中输入参数选择困难和重叠簇识别等问题.首先,设计了一种树拓扑CR-Tree索引结构,将直接核心可达的一对数据点映射成树结构中的父子关系,蕴含了数据点依赖关系的CR-Tree涵盖了一系列sub Eps参数下的基于密度的簇结构;其次,MClu Stream算法采用滑动窗口的方式更新CR-Tree,在线维护当前窗口上的簇结构,实现了对海量数据流的快速演化聚类分析;再次,设计了一种快速从CR-Tree提取簇结构的方法,根据可视化的簇结构,选择合理的聚类结果;最后,在真实和合成海量数据上的实验验证了MClu Stream算法具有有效的挖掘效果、较高的聚类效率和较小的空间开销.MClu Stream可适用于海量数据流应用中自适应的密度聚类演化分析.
  • 基于软集的无标记信息代数模型与算法
    在给定的一个初始论域U和参数集E上的全体软集中引入扩展运算与转移运算,研究了它们的性质.在此基础上引入商软集的概念,并在全体商软集中引入联合运算与聚焦运算,得到其构成一个无标记的信息代数,并且若参数集E有限,这个信息代数还是一个无标记的紧信息代数.最后,给出运用无标记信息代数的模型解决软集中不确定问题的决策算法与实例,并与Cagman等人提出的uni-int决策算法做了比较说明.

外文技术报告

  • 大型集群中快速通用数据处理架构设计
    This dissertation proposes an architecture for cluster computing systems that can tackle emerging data processing workloads while coping with larger and larger scales. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping the scalability and fault tolerance of previous systems. And whereas most deployed systems only support simple one-pass computations (e.g. aggregation or SQL queries), ours also extends to the multi-pass algorithms required for more complex analytics (e.g. iterative algorithms for machine learning). Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing, or SQL and complex analytics.
  • 数据密集型集群中并行工作性能的优化技术
    A simple but key aspect of parallel jobs is the all-or-nothing property: unless all tasks of a job are provided equal improvement, there is no speedup in the completion of the job. The all-or-nothing property is critical for the promise of efficient and fault-tolerant parallel computations on large clusters. Meeting this promise in clusters of these scales is challenging and a key departure from prior work on distributed systems. This talk will look at the execution of a job from first principles and propose techniques spanning the software stack of data analytics systems such that its tasks achieve homogeneous performance while overcoming the various heterogeneities. To that end, we will propose techniques for (i) caching and cache replacement for parallel jobs, which outperforms even Belady's MIN (that uses an oracle), (ii) data locality, and (iii) straggler mitigation. Our analyses and evaluation are performed using workloads from Facebook and Bing production datacenters Along the way, we will also describe how we broke the myth of disk-locality's importance in datacenter computing.

如果没有您需要的报告,您可以到行业研究报告数据库(http://hybg.hbsts.org.cn )查找或定制

如果您在使用中有任何问题,请及时反馈给我们。