Skip to content

pasandrei/community-detection-AI

Repository files navigation

Community Detection

Evolutionary Algorithms vs K-Means

Report by Andrei Popovici, Alexandra Lakatos

Image taken from here

I. Introduction

The scientific problem that we tried to solve is known in the literature as "Community Detection". This problem has various real life use cases such as: detecting dangerous communities in social networks, targeting ads to specific groups of people, and many more. More specifically, we tried to solve the non-overlapping communities version of the problem.

We did experiments using two popular algorithms which will be described in the next chapter:

  • Evolutionary algorithm
  • K-Means clustering

II. Our methods

K-Means Clustering

K-Means Clustering algorithm is a technique that aims to partition n observations into k clusters. It is usually used on multi-dimensional vectors, but for the purpose of this project we wrote a discrete version of K-Means for graphs.

A possible pseudocode for this algorithms goes like this:

  1. initialize random vector centroids
  2. attribute each point to the closest centroid
  3. for no_generations
  4. compute the new centroids
  5. attribute each point to the closest centroid

For the purpose of Graph K-Means algorithm the initial random centroids are random nodes in the graph.

Step 2. is implemented using a Breadth First Search (BFS) in which, in the beginning, every centroid is inserted in a queue. Then, while the queue is not empty, we take the first node, x, from the queue. Each of its neighbours, that was not previously assigned to any cluster, will be assigned to the cluster of the x node. This is the same as step 3.2.

To compute the new centroids mentioned in step3.1. we first have to define what a centroid is. A centroid is a node for which the maximum distance to another node in the cluster is minimal. With this definition in mind, we use another BFS to compute the minimum distance from each node to every other node in the same cluster. The node which has the minimal longest distance is the new centroid of that cluster.

Graphical Explanation:

First Iteration:

Second Iteration:

Evolutionary Algorithm

  1. Representation The individuals of the populations have been represented using locus-based adjacency representation. This means that an individual consists of n genes g 1 , ..., g__n and each gene can take an allele value in the range {1, ..., n}. A value j assigned to the _i_th gene is interpreted as a link between the nodes i and j of the set of vertices V of the network.

  2. Algorithm

The algorithm used is the following, having as parameters the number of individuals from the population, the number of generations and the mutation probability:

  1. generate initial population with n individuals

  2. evaluate individuals

  3. for no_generations

  4. select parents for the next generations

  5. generate offspring by applying crossover

  6. mutate offsprings

  7. evaluate offsprings

  8. select the next generation

  9. Selection The parents which will create the next generation of offsprings have been chosen randomly out of all the individuals from the population. For the next generation, we select the first n best individuals among the old generation and the newly created offsprings,, where n is the initial number of the population.

  10. Crossover

The crossover operator used is standard uniform crossover. In order for this to be performed a binary mask of length equal to the number of nodes is randomly created and an offspring is generated by selecting from the first parent the genes where the mask is 0, and from the second parent the genes where the mask is 1.

  1. Mutation

Mutation operator is applied over each individual with a given probability. Each gene of the chromosome has the chance to be mutated. Its new allele value is randomly chosen among the neighbours of the node representing the gene.

III. Fitness Function

We used modularity (Q) [6] as a fitness function. It can be defined in the following way:

Where L(C i , C i ) is the number of edges that are inside the community i. L(C i , V) is the total number of edges that are connected to the current community.

This function is used by the scientific community because it takes into consideration both the number of connections inside each community and it penalises every edge that is connected to that community. This treats the case in which an algorithm says that the whole graph is a community.

IV. Our results

To test the performance of our algorithms we used two real-world datasets:

  1. Facebook data was collected from survey participants (4039 nodes, 88234 edges) [3]
  2. Zachary's Karate Club [4]

K-Means Clustering

This algorithm shows much better results on bigger datasets compared to the EA implementation. It is also much faster on both datasets. However, on small datasets it performs worse.

According to [5], the best results for Zachary's Karate Club dataset are about ~0.41. The following results are on this dataset:

No_starts: 2500 No_clusters: 2 No_starts: 2500 No_clusters: 3 No_starts: 2500 No_clusters: 10 No_starts: 2500 No_clusters: 20
BEST: 0.320513 AVG: 0.244005 BEST: 0.375372 AVG: 0.264226 BEST: 0.291913 AVG: 0.19077 BEST: 0.143573 AVG: 0.0874909

For 2500 starts of the algorithm, the total time is 0.5 seconds.

These results are on the Facebook dataset:

No_starts: 20 No_clusters: 2 No_starts: 20 No_clusters: 5 No_starts: 20 No_clusters: 10 No_starts: 20 No_clusters: 20
BEST: 0.320513 AVG: 0.244005 RUN_TIME: 120s BEST: 0.663135 AVG: 0.498463 RUN_TIME: 75s BEST: 0.729996 AVG: 0.591778 RUN_TIME: 45s BEST: 0.705235 AVG: 0.671849 RUN_TIME: 25s

For 20 starts of the algorithm, the total time taken is presented in the above table

Evolutionary Algorithm

The results obtained refer to 2 instances, on which several tests were performed: one contains 34 nodes and 78 edges between these nodes and the other, provided by Facebook, contains a number of 4039 nodes with 88234.

For the instance with 34 nodes, the evolutionary algorithm was tested on 4 different combinations of values for the parameters and the results are reported in the following table:

No_population: 500No_generations: 50 No_population: 500No_generations: 100 No_population: 1000No_generations: 50 No_population: 1000No_generations: 100
BEST: 0.401216 AVG: 0.396363 BEST: 0.401216 AVG:0.401081 BEST: 0.401874 AVG: 0.400916 BEST: 0.401874 AVG: 0.401874

According to [1], a value over 0.3 of the modularity fitness function indicates a good community division. Therefore, the evolutionary algorithm results may be interpreted as good results on the instance with 34 nodes. They are also close to the State of the Art results presented in by K. Ozturk et. al in [5], which are ~0.419.

The results from the table indicates the fact that the increase in both the number of individuals from the population and the number of generations will produce a better community division. However, any numbers over 1000 for the population and 100 for the generations will no longer produce better results.

The problem with the Evolutionary Algorithm comes from the instance with 4039 nodes. The modularity fitness function registered the best value ~0.03 or ~0.04, which is an indicator of the fact an Evolutionary Algorithm does not perform well on a graph with a large number of vertices and edges. Moreover, it takes too much time for the algorithm to run on such an instance, therefore it is difficult to perform many tests on it. A number of 10 tests could be run on this instance with 100 individuals in the population and 30 generations. Both the average and best result may be rounded to 0.04.

V. Conclusion

We observed the fact that the two methods implemented in our project report different results for the two datasets. For the facebook instance, the K-Means algorithm greatly outperforms the Evolutionary Algorithm both as running time and results: 0.73 vs 0.04 modularity and 45 seconds vs 76 second (for 50 generations and population of 100). However, for the ZKH dataset the situation is a bit different, the EA method coming at the top (still at greater running time).

Our belief is that had we implemented other heuristics, such as a better method for selecting the parents and the individuals for the next generations or choosing different variation operators, the modularity function still would not reach an acceptable value of ~0.3 for the Facebook dataset.

VI. References

[1] Clara Pizzuti, Evolutionary Computation for Community Detection in Networks: a Review, IEEE Transactions on Evolutionary Computation, vol. X, no. X, X 2017,

[2] Andrew Ng, Introduction to Machine Learning, Coursera

https://www.coursera.org/learn/machine-learning/lecture/93VPG/k-means-algorithm

[3] Jure Leskovec and Andrej Krevl, {SNAP Datasets}: {Stanford} Large Network Dataset Collection, http://snap.stanford.edu/data

[4] W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977)

[5] K. Ozturk, An Evolutionary Approach for Detecting Communities in Social Networks

[6] Newman M.E.J. Girvan, M. Finding and evaluating community structure in networks. Physical Review E, 2(69), 2004

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published