このページは http://www.slideshare.net/ChaToX/finding-dense-subgraphs の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

10ヶ月前 (2015/12/22)にアップロードinテクノロジー

Part of the course "Algorithmic Methods of Data Science". Sapienza University of Rome, 2015.

htt...

Part of the course "Algorithmic Methods of Data Science". Sapienza University of Rome, 2015.

- Graph partitioning I:

Dense Sub-Graphs

Class

Algorithmic Methods of Data Mining

Program

M. Sc. Data Science

University

Sapienza University of Rome

Semester

Fal 2015

Lecturer

Carlos Castil o http://chato.cl/

Sources:

● Tutorial by A. Beutel, L. Akoglu, C. Faloutsos [Link]

● Frieze, Gionis, Tsourakakis: “Algorithmic techniques for modeling

and mining large graphs (AMAzING)” [Tutorial]

● A survey of algorithms for dense sub-graph discovery [link] - Sub-graphs

2 - Subgraph

Subset of

nodes, and

edges among

those nodes

3 - Ego network

x

Ego graph of node x = neighbors and the links between them

4 - Typical pattern

Oddball: Spotting anomalies in weighted graphs

Leman Akoglu, Mary McGlohon, Christos Faloutsos

5

PAKDD 2010 - k-core decomposition

6 - k-core decomposition

● Remove all nodes having degree 1

– Those are in the 1-core

● Remove all nodes having degree 2 in the remaining

graph

– Those nodes are in the 2-core

● Remove all nodes having degree 3 in the remaining

graph

– Those nodes are in the 3-core

● Etc.

7 - Try it!

How many nodes are there in the

each core of this graph?

9

http://www.cpt.univ-mrs.fr/~barrat/NHM.pdf - Graph s-t cuts

10 - Min s-t cut

Given a weighted graph G(V,E), W:E→R

An (s-t)-cut C=(S,T) is such that

– S ∪ T=V

– s ∈ S, t ∈ T

The cost of a cut is

Key problem: given G, s, t, find min weight s-t cut

11 - Example of two s-t cuts

s

t

If all edge weights are equal, which one is a smaller cut,

12

the red or the green? Is this the smal er cut in this case? - Example of two s-t cuts

s

2

5

2

t

4

1

2

4

What about now, what is the

13

smaller s-t cut in the graph? - What defines an s-t cut?

● Can I take an arbitrary set of edges and claim it

is an s-t cut?

● Is this an s-t cut? Why? Why not?

s

t

14 - What defines an s-t cut?

● Can I take an arbitrary set of edges and claim it

is an s-t cut?

● Is this an s-t cut? Why? Why not?

s

t

15 - What defines an s-t cut?

● Can I take an arbitrary set of edges and claim it

is an s-t cut?

● Is this an s-t cut? Why? Why not?

s

t

16 - What defines an s-t cut?

● Can I take an arbitrary set of edges and claim it

is an s-t cut?

● Is this an s-t cut? Why? Why not?

s

t

17 - Simple s-t paths and s-t cuts

● For a subset of edges S of a graph to be a cut,

every simple path between s and t should

contain exactly one edge in E

18 - Maximum flows

19 - Maximum flow: example 1

● If edge weights were capacities, what is the

maximum flow that can be sent from s to t?

s

t

5 m3/second

1 m3/second

3 m3/second

20 - Maximum flow: example 2

● If edge weights were capacities, what is the

maximum flow that can be sent from s to t?

1

5

4

s

t

4

2

6

21 - Maximum flow problem

●

What is the maximum “flow” that can be carried

from s to t?

– Think of edge weights as capacities (e.g. m3/s of

water)

●

What is the flow of an edge?

– The amount sent through that edge (an assignment)

●

What is the net flow of a node?

– The amount exiting the node minus the amount

entering the node

22 - Formulating the max flow problem

● The flow through each edge should be

● Net flow node h = OUT(h) – IN(h)

● Node s should have positive flow v

● Node t should have negative flow -v

● What should be the flow of the other nodes?

23 - Formulating the max flow problem

● Net flow node h = OUT(h) - IN(h)

● Node s should have positive flow v

● Node t should have negative flow -v

h

● What should be the flow of a node?

24 - Writing the dual

26 - Writing the dual

● Remember: the infimum of the solutions of the dual is the

supremum of the solutions of primal

●

Variables u don't enter the objective, only their difference is in the

i

constraints

●

We can set them arbitrarily, in particular u = 0, u = 1

s

t

27 - Dual (after simplification)

● Observe what happens with the values of u

in every path going from s to t

u=0

u=?

...

u=?

u=?

u=1

28 - Dual (after simplification)

● Given these constraints, the sequence must

increase, and can only increase once

u=0

u=0

...

u=1

u=1

u=1

29 - Dual (after simplification)

● Important theorem: every feasible

solution can be written as a cut (S, S')

30 - Dual solutions are cuts

● Every feasible solution of the dual has the form

of a cut (S, S')

S

S'

u =0

s

u =1

t

31 - Dual solutions are cuts

● Every feasible solution of the dual has the form

of a cut (S, S')

u =0

i

u =1

i

S

S'

u =0

s

u =1

t

u =0

i

32 - Dual solutions are (s-t)-cuts

and remember we're trying to minimize

u =0

i

u =1

i

S

S'

u =0

s

u =1

t

u =0

i

33 - One more thing about the solution

y is a dual variable corresponding to primal constraint

ij

If y is non-zero, then the corresponding constraint is tight

ij

34 - Summarizing

● Min-cut and Max-flow are equivalent problems

– Their solutions are also equal: the value of the

maximum flow is equivalent to the minimum cut

● Think of a chain that breaks at the weakest link

● Both can be solved exactly in polynomial time

35 - A simple randomized algorithm

● Pick an edge at random (u,v)

● Merge u and v in new vertex uv

● Edges between u and v are removed

● Edges pointing to u or v are added as multi-

edges to vertex uv

● When only s and t remain, the multi-edges are

a cut, probably the minimum one

36

http://www.cs.berkeley.edu/~jfc/cs174lecs/lec18/lec18.html - Example run

s

t

s

t

s

t

s

t

s

t

s

s

t

t

s

s

t

38

t

Cut of weight 2 found! - Randomized algorithm might miss

the min cut

● Multiple runs are required

● The probability that this finds the min cut in one

run is about 1/log(n), so O(log n) iterations are

required to find min cut

● Each iteration costs O(n2 log n)

● O(n2 log2 n) operations needed to find min cut

● Exact algorithm: O(n3 + n2 log n); the n3 is

because of |V||E| operations required

39

http://www.cs.berkeley.edu/~jfc/cs174lecs/lec18/lec18.html - Densest sub-graph

40 - Density measures

● Density = Average degree = 2|E|/|V|

– Sometimes just |E|/|V|

● Edge ratio = (2|E|)/(|V|(|V|-1))

– What is |V|(|V|-1|)/2?

41 - Densest sub-graph

42 - Goldberg's algorithm for densest

subgraph

● Requires: min-cut problem

43

Slides on this section from: http://www.math.cmu.edu/~ctsourak/amazing.html - Goldberg's algorithm (1)

44 - Goldberg's algorithm (2)

45 - Goldberg's algorithm (3)

46 - Goldberg's algorithm (4)

47 - Goldberg's algorithm (5)

48

If this exists for non-empty S, then S is a sub-graph of density c - Goldberg's algorithm (6)

● to find the densest subgraph perform binary

search on c

– logarithmic number of min-cut calls

– each min-cut call requires O(|V||E|) time

● problem can also be solved with one min-cut

call using the parametric max-flow algorithm

49 - A faster algorithm

● Charikar, M. (2000). Greedy approximation

algorithms for nding dense components in a

graph. In APPROX.

● Approximate algorithm (by a factor of 2)

50 - Greedy algorithm

51 - Example run of Greedy Algorithm

Done!

52 - Example run of Greedy Algorithm

13/11=1.18

14/10=1.40

13/9=1.44

12/8=1.50

11/7=1.57

9/6=1.50

8/5=1.60

6/4=1.50

Density computed

as |E|/|V|

53

3/3=1.00

1/2=0.50

0/1=0.00 - Example run of Greedy Algorithm

13/11=1.18

14/10=1.40

13/9=1.44

12/8=1.50

11/7=1.57

9/6=1.50

8/5=1.60

6/4=1.50

Done!

54

3/3=1.00

1/2=0.50

0/1=0.00 - Approximation guarantee

● S* = optimal sub-graph (highest density)

density(S*) = λ = |e(S*)| / |S*|

●

● For all v in S*, deg(v) >= λ, because

Because of optimality of S*

55

https://people.cs.umass.edu/~barna/paper/dense-subgraph-full.pdf - Approximation guarantee (cont)

Hence,

56 - Approximation guarantee (cont.)

● Now, let's consider when greedy removes the

first vertex of the optimal solution

● At that point, all the vertices of the remaining

subgraph (S) have degree >= λ, because v has

degree >= λ

● Hence, this subgraph has more than

edges, and density more than

●

57

Hence this is a 2-approximate algorithm - Bi-partite near cliques

58 - Dense subgraphs in matrix

representation of a graph

Re-arrange rows and columns

59 - Dense subgraphs in matrix

representation of a graph

Similar to a

bi-partite clique

Re-arrange rows and columns

60 - Example of bi-partite near-cliques

Fans and artists in cultural products also create bi-partite near-cliques.

61 - Scalable method for dense sub-

graphs

● D. Gibson, R. Kumar, and A. Tomkins.

Discovering large dense subgraphs in massive

graphs. In VLDB ’05: Proc. 31st Intl. Conf. on

Very Large Data Bases, pages 721–732. ACM,

2005.

● Can be applied to arbitrarily large graphs

62 - Shingling algorithm

Take a permutation π and apply it to both sets

●

● Take the minimum element in each set under

this permutation

● The probability of the two minima matching is

the Jaccard coefficient of A and B

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering

63

of the web. Comput. Netw. ISDN Syst., 29(8-13):1157–1166, 1997. - Example

● A = {dcab, abcd, cabb, aabd}

● B = {abcd, dabc, abbd, badd, dcab}

● Suppose permutation = “sort by second

character, then by fourth”

– Minimum(A) = cabb

– Minimum(B) = dabc

– Bad luck this time, however …

● If you use many permutations, you can get

good estimates of Jaccard coefficient

64 - How to build the permutations

● What is a natural family of permutations to use?

65 - Yes

● That's why this method is often referred to as

min-hashing

66 - Advantages of shingling

● A and B can be huge

– but the shingle vector is of fixed size!

– comparisons of shingles are much faster

● How to apply this to finding dense sub-graphs?

– We are going to use procedure shingle(list), which

computes a shingle of size c of a list

67 - Algorithm

●

Let e(v) be the edges of v

●

Start with lists <v, e(v)>

●

Compute <v, shingles(e(v))>

●

Invert this list to obtain <shingle, list of v> = S1

●

Cluster this list, how?

– Compute <shingle, shingles(list of v)> = S2

– Cluster S2 using any clustering method

●

Output = list of shingles, and list of vertices sharing those

shingles

68