# Spark GraphX之全局聚类系数、局部聚类系数、网络平均聚类系数

2017/03/10 22:13

### Global clustering coefficient（全局聚类系数）

Another way to measure connectedness, the global clustering coefficient, is better in that it always returns a number between 0 and 1, making it possible to compare the connectedness of different sized graphs.

A triplet in this case is a set of three vertices that have two or three edges among them. If there are three edges, then it’s a triangle, and this is called a closed triplet. If there are only two edges, then it’s called an open triplet. Triplets are counted for each vertex and then added all together; this means that a triangle will count as three closed triplets, because each of the three vertices will have one closed triplet associated with it. ``````  def globalClusteringCoefficient[VD: ClassTag, ED: ClassTag](g:Graph[VD, ED]) = {
val numerator  = g.triangleCount().vertices.map(_._2).reduce(_ + _)
val denominator = g.inDegrees.map{ case (_, d) => d*(d-1) / 2.0 }.reduce(_ + _)
if(denominator == 0) 0.0 else numerator / denominator
}
``````

### local clustering coefficient（局部聚类系数）

that is the ratio of the actual triangle count at a vertex to the number of possible triangles at that vertex based on how many neighbors it has. For an undirected graph, the local clustering coefficient C for a vertex that has k neighbors and t triangles is: C = 2t / [ k * (k - 1) ]

``````  def localClusteringCoefficient[VD: ClassTag, ED: ClassTag](g: Graph[VD, ED]) = {
val triCountGraph = g.triangleCount()
val maxTrisGraph = g.inDegrees.mapValues(srcAttr => srcAttr*(srcAttr-1) / 2.0 )
triCountGraph.vertices.innerJoin(maxTrisGraph){ (vid, a, b) => if(b == 0) 0 else a / b }
}
``````

### network average clustering coefficient（网络平均聚类系数）

Computing the average value of the local clustering coefficient for all of the vertices in the graph gives us the network average clustering coefficient

``````    val vertices = sc.makeRDD(Seq((1L, ""), (1L, ""), (2L, ""), (3L, ""), (4L, ""), (4L, "")))
val edges = sc.makeRDD(Seq(Edge(1L, 1L, ""), Edge(1L, 2L, ""), Edge(2L, 3L, ""), Edge(1L, 3L, ""), Edge(3L, 4L, ""))).filter(e => e.srcId != e.dstId).flatMap(e => Array(e, Edge(e.dstId, e.srcId, e.attr))).distinct()  //去掉自循环边，有向图变为无向图，去除重复边
val testGraph =  Graph(vertices, edges).cache()
``````

``````    testGraph.vertices.foreach(println)
``````

``````(3,)
(1,)
(2,)
(4,)
``````

The vertices and edges arguments that we used to construct the Graph instance were regular RDDs—we didn’t even deduplicate the entries in the vertices so that there was only a single instance of each topic. Fortunately, the Graph API does this for us, converting the RDDs we passed in to a VertexRDD and an EdgeRDD, so that the vertex counts are now unique. Note that if there are duplicate entries in the EdgeRDD for a given pair of vertices, the Graph API will not deduplicate them: GraphX allows us to create multigraphs, which can have multiple edges with different values between the same pair of vertices. This can be useful in applications where the vertices in the graph represent rich objects, like people or businesses, that may have many different kinds of relationships between them (e.g., friends, family members, customers, partners, etc.). It also allows us to treat the edges as either directed or undirected, depending on the context.

``````    println(globalClusteringCoefficient2(testGraph))
``````

``````0.6
``````

``````    localClusteringCoefficient(testGraph).foreach(println)
``````

``````(3,0.3333333333333333)
(1,1.0)
(2,1.0)
(4,0.0)
``````

``````    println(localClusteringCoefficient(testGraph).map(_._2).sum() / testGraph.vertices.count())
``````

``````0.5833333333333333
``````

0
0 收藏

### 作者的其它热门文章 0 评论
0 收藏
0 