Type an unfamiliar term into Google, and chances are your quest for answers will cross paths with Wikipedia. With more than 470 million unique monthly visitors as of February 2012, the world’s free encyclopedia has become a popular source of information. Our team (Jackie Cohen, Priya Kumar, and Florence Lee) used network principles to explore where Wikipedia gets its information.
Our analysis suggests that Wikipedia’s best articles cite similar sources. Why is this important? Information about the most frequently cited domains may give Wikipedia editors a good starting point to improve articles that need additional references.
We reviewed the citation network of Wikipedia’s English-language featured articles to discover which categories of articles shared similar citation sources, or domains. Wikipedia organizes its more than 4,200 featured articles into 44 categories; we found that every pair of categories shares at least one domain, creating a completely connected network.
In the network graph (Figure 1), each category is a node. If two categories share at least one domain, an edge appears between them. Since every category pair shares at least one domain, each node shares an edge to every other node. The graph has 44 nodes and 946 edges.
Figure 1: Citation Network of English-Language Wikipedia Featured Articles
But the mere existence of an edge doesn’t tell us about the strength of the relationship, or the number of shared domains, between two categories. The two categories could share one domain or hundreds. We assigned weights to the edges to determine which pairs share more domains than others.
First, we determined how many shared domains existed in the entire network. If a domain appeared in articles of at least two categories, we considered it a shared domain. For example, at least one Wikipedia article in the biology category cited an nytimes.com link, and at least one Wikipedia article in the law category also cited an nytimes.com link. So we added nytimes.com to the list of shared domains. Overall, we found 1,103 shared domains in the network.
We calculated edge weights by dividing the number of shared domains between a category pair by the total number of shared domains in the network. For example, biology and law shared 14 domains, so the pair’s edge weight was 0.0127 (14 divided by 1,103).
The distribution of edge weights appears to be a power law distribution (Figure 2). But graphing the distribution on a log-log scale (Figure 3) shows a curved line. Despite the linear distribution’s long tail, it doesn’t appear to be a true power law distribution.
Figure 2: Edge Weight Distribution – Linear Scale Figure 3: Edge Weight Distribution – Log-Log Scale
We scaled the edges on an RGB spectrum. The vast majority of category pairs cite fewer than five percent of the shared domains, which is why thick cables of blue traverse the network graph in Figure 1. The occasional turquoise edges represent the pairs that cite more than five percent of shared domains.
The pairs that share the most domains are:
- Politics and Government Biographies & Religion, Mysticism, and Mythology (223 shared domains; 0.2022 edge weight)
- Physics and Astronomy & Physics and Astronomy Biographies (159 shared domains; 0.1442 edge weight)
- Physics and Astronomy & Religion, Mysticism, and Mythology (150 shared domains; 0.1360 edge weight)
The second pair feels intuitive. We scratched our heads at the first pair and found the third pair interesting, given that the two categories often appear on different sides of various public debates. Some of the shared domains between this pair, such as slate.com, jstor.org, and christianitytoday.com, were unsurprising, but we did notice several unexpected shared domains in this pair, including brooklynvegan.com and vulture.com.
Figure 2 depicts an elbow around the edge weight of 4.6 percent. If we use this as a threshold to create the network, that is, only draw an edge if its weight is higher than 0.046, the network becomes far less connected (Figure 4).
Figure 4: Citation Network with an Edge Weight Threshold of 4.6 Percent
We also examined the domains themselves. The three most popular shared domains were:
The widespread citation of these domains aligns with Wikipedia’s encyclopedic nature; these sites are gateways into vast swaths of digitally recorded information and knowledge.
Considering the least popular domains, 601 domains were only shared between one category pair. Removing those domains from the graph only deleted four edges, since most category pairs share more than one domain. This suggests that edge weight is a better threshold for examining the relationships in this network than domain distribution.
While typical network characteristics such as centrality measures, community structures, or diffusion were not relevant in the completely connected network, examining edge weights yielded interesting findings. Future work could examine network characteristics of the thresholded graph as well as consider whether patterns exist in the way various category pairs cite different domain types (e.g., journalism, scholarly, personal blogs, etc).
Project Code: Available here
The project can be replicated by running: grab_data.py, manage_data.py, data_counts.py. The first two files collect and parse the data we describe above. The data_counts.py file contains all the network manipulation, and, if you download the entire repository, can be run immediately (the repository includes the results from the former two files). This last file contains comments that explain where in the code we determined different network metrics and examined aspects of our network. This includes where we implemented edge weight thresholding (Figure 1, Figure 4) and where we conducted Pythonic investigations into whether the edge weight distribution was a power law distribution.