Type an unfamiliar term into Google, and chances are your quest for answers will cross paths with Wikipedia. With more than 470 million unique monthly visitors as of February 2012, the world’s free encyclopedia has become a popular source of information. Our team (Jackie Cohen, Priya Kumar, and Florence Lee) used network principles to explore where Wikipedia gets its information.
Our analysis suggests that Wikipedia’s best articles cite similar sources. Why is this important? Information about the most frequently cited domains may give Wikipedia editors a good starting point to improve articles that need additional references.
We reviewed the citation network of Wikipedia’s English-language featured articles to discover which categories of articles shared similar citation sources, or domains. Wikipedia organizes its more than 4,200 featured articles into 44 categories; we found that every pair of categories shares at least one domain, creating a completely connected network.
In the network graph (Figure 1), each category is a node. If two categories share at least one domain, an edge appears between them. Since every category pair shares at least one domain, each node shares an edge to every other node. The graph has 44 nodes and 946 edges.
Figure 1: Citation Network of English-Language Wikipedia Featured Articles
But the mere existence of an edge doesn’t tell us about the strength of the relationship, or the number of shared domains, between two categories. The two categories could share one domain or hundreds. We assigned weights to the edges to determine which pairs share more domains than others.
First, we determined how many shared domains existed in the entire network. If a domain appeared in articles of at least two categories, we considered it a shared domain. For example, at least one Wikipedia article in the biology category cited an nytimes.com link, and at least one Wikipedia article in the law category also cited an nytimes.com link. So we added nytimes.com to the list of shared domains. Overall, we found 1,103 shared domains in the network.
We calculated edge weights by dividing the number of shared domains between a category pair by the total number of shared domains in the network. For example, biology and law shared 14 domains, so the pair’s edge weight was 0.0127 (14 divided by 1,103).
The distribution of edge weights appears to be a power law distribution (Figure 2). But graphing the distribution on a log-log scale (Figure 3) shows a curved line. Despite the linear distribution’s long tail, it doesn’t appear to be a true power law distribution.
Figure 2: Edge Weight Distribution – Linear Scale Figure 3: Edge Weight Distribution – Log-Log Scale
We scaled the edges on an RGB spectrum. The vast majority of category pairs cite fewer than five percent of the shared domains, which is why thick cables of blue traverse the network graph in Figure 1. The occasional turquoise edges represent the pairs that cite more than five percent of shared domains.
The pairs that share the most domains are:
- Politics and Government Biographies & Religion, Mysticism, and Mythology (223 shared domains; 0.2022 edge weight)
- Physics and Astronomy & Physics and Astronomy Biographies (159 shared domains; 0.1442 edge weight)
- Physics and Astronomy & Religion, Mysticism, and Mythology (150 shared domains; 0.1360 edge weight)
The second pair feels intuitive. We scratched our heads at the first pair and found the third pair interesting, given that the two categories often appear on different sides of various public debates. Some of the shared domains between this pair, such as slate.com, jstor.org, and christianitytoday.com, were unsurprising, but we did notice several unexpected shared domains in this pair, including brooklynvegan.com and vulture.com.
Figure 2 depicts an elbow around the edge weight of 4.6 percent. If we use this as a threshold to create the network, that is, only draw an edge if its weight is higher than 0.046, the network becomes far less connected (Figure 4).
Figure 4: Citation Network with an Edge Weight Threshold of 4.6 Percent
We also examined the domains themselves. The three most popular shared domains were:
The widespread citation of these domains aligns with Wikipedia’s encyclopedic nature; these sites are gateways into vast swaths of digitally recorded information and knowledge.
Considering the least popular domains, 601 domains were only shared between one category pair. Removing those domains from the graph only deleted four edges, since most category pairs share more than one domain. This suggests that edge weight is a better threshold for examining the relationships in this network than domain distribution.
While typical network characteristics such as centrality measures, community structures, or diffusion were not relevant in the completely connected network, examining edge weights yielded interesting findings. Future work could examine network characteristics of the thresholded graph as well as consider whether patterns exist in the way various category pairs cite different domain types (e.g., journalism, scholarly, personal blogs, etc).
Project Code: Available here
The project can be replicated by running: grab_data.py, manage_data.py, data_counts.py. The first two files collect and parse the data we describe above. The data_counts.py file contains all the network manipulation, and, if you download the entire repository, can be run immediately (the repository includes the results from the former two files). This last file contains comments that explain where in the code we determined different network metrics and examined aspects of our network. This includes where we implemented edge weight thresholding (Figure 1, Figure 4) and where we conducted Pythonic investigations into whether the edge weight distribution was a power law distribution.
Learn to code? The question populated headlines this year. The Atlantic‘s Olga Khazan set journalists a-Twitter after pronouncing that journalism schools should not require students to “learn code.” She insisted her opposition extended to HTML and CSS, not data journalism, data analysis, or data visualization, making her post’s headline feel misleading given that those can require learning code.
Sean Mussenden of the American Journalism Review concisely expressed what I thought when reading Khazan’s piece. I fact-checked AJR articles in college, and tricking my brain to think I was fact-checking is the only thing that saved me from hurling a rock at my laptop while coding.
Four months ago I was a coding newbie. My crowning achievement was a Python script that determined whether a given string of text was of Tweet-able length. By December, I had cleaned and manipulated datasets in Python, created heat maps and scree plots in R, designed map visualizations in D3, and analyzed my Facebook and Twitter data. I needed the structure and graded homework assignments that graduate school courses in data manipulation, exploratory data analysis, and information visualization offered, but I wouldn’t have survived those classes without the wealth of resources on the Interwebz. These lessons I absorbed may help you meet your code-learning resolutions.
1. Find a tutorial that works for you
Free online tutorials abound. Shop around, take what works, and leave what doesn’t. I’m not suggesting giving up at the first sign of difficulty. Coding is hard, frustrating, tedious, and time-consuming. But it won’t always be. Rewards, even just the personal satisfaction of overcoming challenges, await those patient enough to try. Sink your time into a tutorial that fits your learning style and avoid wasting time on one that doesn’t. Last January I enrolled in a Coursera class on data analysis in R. The description said a programming background was helpful but not required. A week into the course, it was clear: a programming background was definitely required. I couldn’t afford to spend 10 hours on assignments I didn’t understand, so I stopped.
2. Google is your friend
Tutorials won’t give you all the information you need, but Google can help. Paste your error message into the search bar to get a sense of what went wrong. Or, (and I found this more effective), type what you’re trying to accomplish. Even the craziest phrase (“after splitting elements in lines in python, keep elements together in for loop”) will get you somewhere. People often share snippets of code on forums like Stack Overflow. Test their code on your machine and see what happens. Debugging is a random walk, requiring you to chase links and try several strategies before that glorious moment when the code finally listens to you. Don’t worry. You’re learning even when you’re doing it wrong.
3. But people are your best friend
I tweeted my frustration with the Coursera class last January. To my surprise, digital storyteller Amanda Hickman responded to my tweets and set up a Tumblr to walk me through the basics of R Studio. People want to help, and their help will get you through the frustration of learning to code. This semester I saw the graduate student instructor nearly every week during office hours, bringing him the specific or conceptual questions that tutorials and Google couldn’t explain me. When you get stuck, reach out. Ask that cousin who works in IT to help you debug something. Post on social media that you’re looking for help. Use Meetup to find fellow coders with whom you can meet face-to-face. Find groups like PyLadies (for Python) and go to their meetings. Don’t let impostor syndrome, or the feeling that you’re not really a “coder” stop you. You are a coder.
4. Take breaks
My first coding professor said, “Don’t spend hours on a coding problem. Take a break and return when your mind is fresh.” LISTEN TO HIM. More than once, I sunk six or seven hours trying to debug code, only to collapse into bed and then solve the problem within an hour the next morning. When coding threatens to consume your life (or unleash dormant violent tendencies), say, “Eff this for now” and take a well-deserved break.