The one-hop neighborhood of Paul Erdös
Transcription
The one-hop neighborhood of Paul Erdös
The one-hop neighborhood of Paul Erdös Souvik Bhattacherjee (bsouvik@cs.umd.edu) Introduction Paul Erdös (26 March 1913 – 20 September 1996) was a prolific Hungarian mathematician of the 20th century, who spent a significant portion of his life out of a suitcase and writing papers with those of his colleagues willing to give him room and board. He published more papers than any other mathematician in history. The idea of the Erdös number was created by his fellow mathematicians as a humorous tribute to his enormous output as one of the most prolific modern writers of mathematical papers [1]. An Erdös number 1 is awarded to the person who has published at least one mathematical paper with the celebrated mathematician. Similarly, joint publications with someone with an Erdös number of 1 yield an Erdös number of 2. Erdös himself has the number 0. The Erdös number has gained prominence in scientific circles as one of the important metrics of adjudging mathematical prowess of a mathematician. In this project, we try to understand the collaboration network of authors having an Erdös number of 1 using NodeXL, a popular visualization tool for network analysis. Dataset & Preprocessing We obtain the dataset for this project from the Erdös Number Project [2]. Two datasets were used for this project which is described below: 1. Erdos0 - This dataset lists all authors who have written a joint paper with Paul Erdös (i.e., who have Erdös number 1). It is in alphabetical order and shows the date of first collaboration, as well as the number of papers that each person has written with Erdös. There are currently 511 names on this list [3]. 2. Erdos1graph – It contains the adjacency lists for the induced subgraph of the collaboration graph on all Erdös coauthors, as of 2007. In other words, its vertices are people with Erdös number 1, and are joined by an edge if they have published a joint paper (with or without other collaborators). Paul Erdös himself and people with Erdös number 2 are not included. In addition it also contains the number of Erdös number 2 authors that an author in this list has collaborated with [4]. We had to preprocess the Erdos0 list to include a 1 in those places which did not have an entry to indicate the number of publications between that author and Erdös. For the Erdos1graph the adjacency list had to be converted to an undirected graph with no duplicate edges (taking the lower triangular matrix). Also, the row author names and related information had to be separately parsed and joined with the author list from the Erdos0 dataset. Results and Analysis We analyze the edge list of the Erdös #1 coauthors. There are 511 vertices excluding Erdös and 3208 edges in total. In some analysis we include Erdös as well to enhance the visualization, where we felt it was necessary. Headline 1: There are 2 types of authors that Erdös wrote papers with: Those who also wrote papers among themselves and those who did not, at all. Figure 1: Graph of Erdös #1 coauthors (grouped by connected components) We group the coauthors having Erdös #1 by connected components and the results can be seen from Figure 1. There are 42 connected components in which the largest component has 466 authors which constitute 91.19% of the total number of authors in this network. The remaining authors as can be seen from Figure 1 are either isolated or form a component of size at most 2. The diameter of the largest connected component is 10. The presence of quite a number of isolated components intrigues us to explore the properties of those authors further (presented later). We analyze the graph further by grouping them into clique motif of size more than 4. It is easy to see that the cliques would all be formed within the large connected component as all the other components have a size less than 4. We found that the largest clique is a 7-clique. Apart from this, this component contains 1 6-clique, 4 5-cliques and 19 4-cliques. Figure 2: Graph of Erdös #1 coauthors (grouped by clique motif of size 4 or more) Headline 2: a) Harary Frank* and Noga Alon coauthored actively with both Erdös #1 and Erdös #2 authors whereas Peter Salamon coauthored only with Erdös #2 authors We layout the graph of Erdös #1 coauthors with the X-axis representing the # of Erdös #1 coauthors and the Y-axis representing the # of Erdös #2 coauthors in Figure 3. The authors to the extreme right in the X-axis (circled) are also the authors who are also positioned highest along the Y-axis. They are Harary Frank* (44 Erdös #1 coauthors and 271 Erdös #2 coauthors) and Noga Alon (51 Erdös #1 coauthors and 228 Erdös #2 coauthors). We also notice Peter Salamon (circled) to the extreme left along the X-axis who stands out among the rest of the authors in the same region with 113 Erdös #2 coauthors. b) Lee Albert Rubel* plays a central role in this network with a comparatively lower number of Erdös #1 coauthors In the same graph, we order the size of the vertices by their degree and color them by betweenness centrality. A blue node (circled in red) along the middle of the X-axis catches our attention. To observe the node in detail, we use dynamic filtering to retain the top-10 nodes having the highest values of betweenness centrality (Figure 4). This node is particularly interesting because it is the node with a comparatively low degree which has a significant betweenness centrality value. Upon careful scrutiny, we found that this author has the lowest degree among the top-10 authors but ranks 4th in the betweenness centrality value. He coauthored with 3 of the most prolific Erdös #1 authors; Ernst Gabor Straus (rank 3), Carl Bernard Pomerance (rank 5) and Zoltan Furedi (rank 8), who in turn collaborated with the top coauthors in this graph. Thus even with a comparatively low degree of 12 this node plays a central role in the coauthor network, the next highest degree being 26. Figure 3: Graph of Erdös #1 coauthors (X-axis: # of Erdös #1 coauthors, Y-axis: # of Erdös #2 coauthors) Figure 4: Graph of Erdös #1 coauthors (Top-10 ordered by betweenness centrality) Headline 3: Most of the Erdös #1 authors who did not collaborate with any Erdös #1 author also collaborated less with Erdös #2 authors Figure 5: Graph of Erdös #1 coauthors with Erdös who did not coauthor any paper with Erdös #1 author We construct the graph of isolated authors in the Erdös #1 collaboration graph, keeping Erdös in this case to have edges in this graph (Figure 5). The vertices (representing authors) are labeled by their names and the edges are labeled by the year in which the corresponding author first published a paper with Erdös. The edge width is determined by the total number of publications that this author has with Erdös, with 3 as the maximum edge weight in this graph. The size of the vertices represents the number of Erdös #2 coauthors that the author has collaborated with. The color of the vertices indicates whether the author is living (blue) or has deceased (orange). We also order the authors (manually) by the year in which they first publish a paper with Erdös, with the year increasing in a clockwise fashion. We observe from Figure 5 that the sizes of most of the vertices are very less indicating that these isolated authors also collaborated less with Erdös #2 coauthors, with the notable exceptions being Peter Salamon, Marcus Solomon and Tarski Alfred* having collaborated with 113, 33 and 26 Erdös #2 authors, respectively. This visualization also helps us to identify the oldest collaborators of Erdös, who are still living; Joseph Lehner, in this graph. Headline 4: Birds of same feather flock together: The top Erdös #1 collaborators also collaborated highly among themselves Figure 6: Graph of Erdös #1 coauthors having 30 or more collaborations (with Erdös #1 authors) Collaboration graph of Erdös #1 coauthors having 30 or more collaborators is presented in Figure 6. We observe that this graph is strongly connected indicating that the authors in this graph also collaborated highly with each other. Figure 7 clusters these 14 authors using Girvan Newman clustering algorithm. We observe that there is a 6-clique and a 4-clique which furthers establishes the high connectivity of this network. Figure 7: Graph of Erdös #1 coauthors having 30 or more collaborators (clustered using Girvan Newman clustering algorithm) Headline 5: Erdös #1 authors having high collaborations with Erdös #2 authors did not collaborate highly among themselves We construct the graph of Erdös #1 authors who has collaborated with 100 or more Erdös #2 collaborators (Figure 8). The sizes of the nodes represent the number of Erdös #2 coauthors that this author has. The coloring is done based on the actual degree of the node in the Erdös #1 collaboration graph (Figure 1). The observation here is that the graph in Figure 8 is not so strongly connected unlike in Figure 6. This implies that these authors do not collaborate highly among themselves. In fact, the author Saharon Shelah does not have any collaboration in this graph although he has collaborated with 15 other Erdös #1 authors. Peter Salamon did not have any collaboration with any of the Erdös #1 coauthors and is therefore not a surprise here. Figure 8: Graph of Erdös #1 coauthors having 100 or more Erdös #2 collaborators NodeXL Critiques NodeXL is a great tool for handling graphs especially because of the fact that it is integrated with Microsoft Excel. I have had the chance of using Pajek (another network analysis tool) before but haven’t found it be as flexible as NodeXL. The features of NodeXL that interest me the most are the Grouping options; especially the cluster and the motifs. The Graph Metric and the Autofill options were equally useful. However there are few things that I feel needs more attention (as I found out during the course of my NodeXL usage) and are listed below: 1. The user needs to handle isolated nodes manually for displaying them. If there are a lot of isolated nodes in the graph it becomes problematic. 2. The legends occupy a large portion of the actual screen below the actual display (shown in Figure 6 and Figure 8) which is wasteful. 3. The dynamic filter does not change the attributes of the graph dynamically. Consider for example, a large graph is filtered based on some vertex attribute (say, betweenness centrality) and the vertex size is dependent on the degree of the vertex. The vertices present in the filtered graph might have low degrees now but the sizes of the vertices pertain to their original degrees. In some cases, the original degree might be a requirement but an option may be presented to the user where the vertex properties change dynamically, as well. 4. It would be useful if the edges in a graph can be laid out in some order in a Star layout (which is one of the most common layouts). Although the Fruchterman-Reingold layout does give the layout a Star shape but it does not have the option of ordering the edges. This idea comes from the clock glyph designs studied earlier in this course. In this case, I had to lay out the edges manually (Figure 5). References 1. 2. 3. 4. Erdös Number. http://en.wikipedia.org/wiki/Erd%C5%91s_number The Erdös Number Project. http://www.oakland.edu/enp/ Erdos0 dataset. https://files.oakland.edu/users/grossman/enp/Erdos0.html Erdos1graph dataset. https://files.oakland.edu/users/grossman/enp/erdos1graph.html