文档章节

Spark GraphX之全局聚类系数、局部聚类系数、网络平均聚类系数

绝世武神
 绝世武神
发布于 2017/03/10 22:13
字数 975
阅读 380
收藏 0

想要测量一张图的连通性,这可以通过调用GraphX原生支持的triangleCount()来实现。但是如果想要对比多张图的连通性,这时又该如何呢?

Global clustering coefficient(全局聚类系数)

Another way to measure connectedness, the global clustering coefficient, is better in that it always returns a number between 0 and 1, making it possible to compare the connectedness of different sized graphs.

全局聚类系数通常定义如下:closed triplets / total triplets (open or closed)

A triplet in this case is a set of three vertices that have two or three edges among them. If there are three edges, then it’s a triangle, and this is called a closed triplet. If there are only two edges, then it’s called an open triplet. Triplets are counted for each vertex and then added all together; this means that a triangle will count as three closed triplets, because each of the three vertices will have one closed triplet associated with it.

输入图片说明

如图所示,存在与一个三角形相关联的三个闭合三元组:与顶点1相关联的一个闭合三元组(1-2-3),与顶点2相关联的一个闭合三元组(2-3-1),以及与顶点3相关联的一个闭合三元组(3-1-2)。另外,与顶点3相关联的还有两个开放三元组,即(1-3-4)和(2-3 -4)。所以,此图的全局聚类系数为{(1-2-3),(2-3-1),(3-1-2)} / {(1-2-3),(2-3-1),(3-1-2),(1-3-4),(2-3 -4)} = 3 / 5 = 0.6

计算全局聚类系数的代码如下:

  def globalClusteringCoefficient[VD: ClassTag, ED: ClassTag](g:Graph[VD, ED]) = {
    val numerator  = g.triangleCount().vertices.map(_._2).reduce(_ + _)
    val denominator = g.inDegrees.map{ case (_, d) => d*(d-1) / 2.0 }.reduce(_ + _)
    if(denominator == 0) 0.0 else numerator / denominator
  }

local clustering coefficient(局部聚类系数)

与全局聚类系数相对应,有一个局部聚类系数的概念。

that is the ratio of the actual triangle count at a vertex to the number of possible triangles at that vertex based on how many neighbors it has. For an undirected graph, the local clustering coefficient C for a vertex that has k neighbors and t triangles is: C = 2t / [ k * (k - 1) ]

计算局部聚类系数的代码如下:

  def localClusteringCoefficient[VD: ClassTag, ED: ClassTag](g: Graph[VD, ED]) = {
    val triCountGraph = g.triangleCount()
    val maxTrisGraph = g.inDegrees.mapValues(srcAttr => srcAttr*(srcAttr-1) / 2.0 )
    triCountGraph.vertices.innerJoin(maxTrisGraph){ (vid, a, b) => if(b == 0) 0 else a / b }
  }

network average clustering coefficient(网络平均聚类系数)

Computing the average value of the local clustering coefficient for all of the vertices in the graph gives us the network average clustering coefficient

测试代码如下:

    val vertices = sc.makeRDD(Seq((1L, ""), (1L, ""), (2L, ""), (3L, ""), (4L, ""), (4L, "")))
    val edges = sc.makeRDD(Seq(Edge(1L, 1L, ""), Edge(1L, 2L, ""), Edge(2L, 3L, ""), Edge(1L, 3L, ""), Edge(3L, 4L, ""))).filter(e => e.srcId != e.dstId).flatMap(e => Array(e, Edge(e.dstId, e.srcId, e.attr))).distinct()  //去掉自循环边,有向图变为无向图,去除重复边
    val testGraph =  Graph(vertices, edges).cache()

打印顶点:

    testGraph.vertices.foreach(println)

输出结果如下:

(3,)
(1,)
(2,)
(4,)

从上述输出来看,不难发现GraphX自动为我们去除了重复顶点。

The vertices and edges arguments that we used to construct the Graph instance were regular RDDs—we didn’t even deduplicate the entries in the vertices so that there was only a single instance of each topic. Fortunately, the Graph API does this for us, converting the RDDs we passed in to a VertexRDD and an EdgeRDD, so that the vertex counts are now unique. Note that if there are duplicate entries in the EdgeRDD for a given pair of vertices, the Graph API will not deduplicate them: GraphX allows us to create multigraphs, which can have multiple edges with different values between the same pair of vertices. This can be useful in applications where the vertices in the graph represent rich objects, like people or businesses, that may have many different kinds of relationships between them (e.g., friends, family members, customers, partners, etc.). It also allows us to treat the edges as either directed or undirected, depending on the context.

计算全局聚类系数:

    println(globalClusteringCoefficient2(testGraph))

输出如下:

0.6

计算局部聚类系数:

    localClusteringCoefficient(testGraph).foreach(println)

输出如下:

(3,0.3333333333333333)
(1,1.0)
(2,1.0)
(4,0.0)

计算网络平均聚类系数:

    println(localClusteringCoefficient(testGraph).map(_._2).sum() / testGraph.vertices.count())

输出如下:

0.5833333333333333

友情链接

© 著作权归作者所有

绝世武神
粉丝 20
博文 33
码字总数 48343
作品 0
海淀
程序员
私信 提问
Spark 数据分析导论-笔记

Spark Core Spark Core 实现了Spark 的基本功能,包含任务调度、内存管理、错误恢复、与存储系统交互等模块。 Spark Core 中还包含了 对弹性分布式数据集(resilient distributed dataset,简...

Java搬砖工程师
2018/12/26
28
0
Spark GraphX宝刀出鞘,图文并茂研习图计算秘笈与熟练的掌握Scala语言【大数据Spark

Spark GraphX宝刀出鞘,图文并茂研习图计算秘笈 大数据的概念与应用,正随着智能手机、平板电脑的快速流行而日渐普及,大数据中图的并行化处理一直是一个非常热门的话题。图计算正在被广泛地...

Spark亚太研究院
2014/08/29
1K
0
Spark2.1.0之模块设计

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/beliefer/article/details/80386736 在阅读本文之前,读者最好已经阅读了《Spark2.1.0之初识Spark》和《Spark...

泰山不老生
2018/06/05
0
0
[Kafka与Spark集成系列一] Spark入门

版权声明:本文为博主原创文章,未经博主朱小厮允许不得转载。 https://blog.csdn.net/u013256816/article/details/82081946 Spark是一个用来是实现快速而通用的集群计算的平台。Spark是UC ...

朱小厮
2018/08/26
0
0
Spark之GraphX的特点

1.基于内存实现了数据的复用与快速读取 具有较多迭代次数是图计算算法的一个重要特点。在海量数据背景下,如何保证图计算算法的执行效率是所有图计算模型面对的一个难题。基于MapReduce的图计...

mmake1994
2018/04/16
0
0

没有更多内容

加载失败,请刷新页面

加载更多

OSChina 周六乱弹 —— 早上儿子问我他是怎么来的

Osc乱弹歌单(2019)请戳(这里) 【今日歌曲】 @凉小生 :#今日歌曲推荐# 少点戾气,愿你和这个世界温柔以待。中岛美嘉的单曲《僕が死のうと思ったのは (曾经我也想过一了百了)》 《僕が死の...

小小编辑
今天
1K
12
Excption与Error包结构,OOM 你遇到过哪些情况,SOF 你遇到过哪些情况

Throwable 是 Java 中所有错误与异常的超类,Throwable 包含两个子类,Error 与 Exception 。用于指示发生了异常情况。 Java 抛出的 Throwable 可以分成三种类型。 被检查异常(checked Exc...

Garphy
今天
38
0
计算机实现原理专题--二进制减法器(二)

在计算机实现原理专题--二进制减法器(一)中说明了基本原理,现准备说明如何来实现。 首先第一步255-b运算相当于对b进行按位取反,因此可将8个非门组成如下图的形式: 由于每次做减法时,我...

FAT_mt
昨天
40
0
好程序员大数据学习路线分享函数+map映射+元祖

好程序员大数据学习路线分享函数+map映射+元祖,大数据各个平台上的语言实现 hadoop 由java实现,2003年至今,三大块:数据处理,数据存储,数据计算 存储: hbase --> 数据成表 处理: hive --> 数...

好程序员官方
昨天
61
0
tabel 中含有复选框的列 数据理解

1、el-ui中实现某一列为复选框 实现多选非常简单: 手动添加一个el-table-column,设type属性为selction即可; 2、@selection-change事件:选项发生勾选状态变化时触发该事件 <el-table @sel...

everthing
昨天
20
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部