Tagged: data

A Citation Network of Wikipedia’s Featured Articles

Type an unfamiliar term into Google, and chances are your quest for answers will cross paths with Wikipedia. With more than 470 million unique monthly visitors as of February 2012, the world’s free encyclopedia has become a popular source of information. Our team (Jackie Cohen, Priya Kumar, and Florence Lee) used network principles to explore where Wikipedia gets its information.

Our analysis suggests that Wikipedia’s best articles cite similar sources. Why is this important? Information about the most frequently cited domains may give Wikipedia editors a good starting point to improve articles that need additional references.

We reviewed the citation network of Wikipedia’s English-language featured articles to discover which categories of articles shared similar citation sources, or domains. Wikipedia organizes its more than 4,200 featured articles into 44 categories; we found that every pair of categories shares at least one domain, creating a completely connected network.

In the network graph (Figure 1), each category is a node. If two categories share at least one domain, an edge appears between them. Since every category pair shares at least one domain, each node shares an edge to every other node. The graph has 44 nodes and 946 edges.

Figure 1: Citation Network of English-Language Wikipedia Featured Articles

network_graph

But the mere existence of an edge doesn’t tell us about the strength of the relationship, or the number of shared domains, between two categories. The two categories could share one domain or hundreds. We assigned weights to the edges to determine which pairs share more domains than others.

First, we determined how many shared domains existed in the entire network. If a domain appeared in articles of at least two categories, we considered it a shared domain. For example, at least one Wikipedia article in the biology category cited an nytimes.com link, and at least one Wikipedia article in the law category also cited an nytimes.com link. So we added nytimes.com to the list of shared domains. Overall, we found 1,103 shared domains in the network.

We calculated edge weights by dividing the number of shared domains between a category pair by the total number of shared domains in the network. For example, biology and law shared 14 domains, so the pair’s edge weight was 0.0127 (14 divided by 1,103).

The distribution of edge weights appears to be a power law distribution (Figure 2). But graphing the distribution on a log-log scale (Figure 3) shows a curved line. Despite the linear distribution’s long tail, it doesn’t appear to be a true power law distribution.

Figure 2: Edge Weight Distribution – Linear Scale      Figure 3: Edge Weight Distribution – Log-Log Scale

linear                  log-log

We scaled the edges on an RGB spectrum. The vast majority of category pairs cite fewer than five percent of the shared domains, which is why thick cables of blue traverse the network graph in Figure 1. The occasional turquoise edges represent the pairs that cite more than five percent of shared domains.

The pairs that share the most domains are:

  1. Politics and Government Biographies &  Religion, Mysticism, and Mythology (223 shared domains; 0.2022 edge weight)
  2. Physics and Astronomy & Physics and Astronomy Biographies (159 shared domains; 0.1442 edge weight)
  3. Physics and Astronomy & Religion, Mysticism, and Mythology (150 shared domains; 0.1360 edge weight)

The second pair feels intuitive. We scratched our heads at the first pair and found the third pair interesting, given that the two categories often appear on different sides of various public debates. Some of the shared domains between this pair, such as slate.com, jstor.org, and christianitytoday.com, were unsurprising, but we did notice several unexpected shared domains in this pair, including brooklynvegan.com and vulture.com.

Figure 2 depicts an elbow around the edge weight of 4.6 percent. If we use this as a threshold to create the network, that is, only draw an edge if its weight is higher than 0.046, the network becomes far less connected (Figure 4).

Figure 4: Citation Network with an Edge Weight Threshold of 4.6 Percent

threshold

We also examined the domains themselves. The three most popular shared domains were:

The widespread citation of these domains aligns with Wikipedia’s encyclopedic nature; these sites are gateways into vast swaths of digitally recorded information and knowledge.

Considering the least popular domains, 601 domains were only shared between one category pair. Removing those domains from the graph only deleted four edges, since most category pairs share more than one domain. This suggests that edge weight is a better threshold for examining the relationships in this network than domain distribution.

While typical network characteristics such as centrality measures, community structures, or diffusion were not relevant in the completely connected network, examining edge weights yielded interesting findings. Future work could examine network characteristics of the thresholded graph as well as consider whether patterns exist in the way various category pairs cite different domain types (e.g., journalism, scholarly, personal blogs, etc).

Project Code: Available here

The project can be replicated by running: grab_data.py, manage_data.py, data_counts.py. The first two files collect and parse the data we describe above. The data_counts.py file contains all the network manipulation, and, if you download the entire repository, can be run immediately (the repository includes the results from the former two files). This last file contains comments that explain where in the code we determined different network metrics and examined aspects of our network. This includes where we implemented edge weight thresholding (Figure 1, Figure 4) and where we conducted Pythonic investigations into whether the edge weight distribution was a power law distribution.

Advertisements

Keep Your Sanity While Learning to Code

Learn to code? The question populated headlines this year. The Atlantic‘s Olga Khazan set journalists a-Twitter after pronouncing that journalism schools should not require students to “learn code.” She insisted her opposition extended to HTML and CSS, not data journalism, data analysis, or data visualization, making her post’s headline feel misleading given that those can require learning code.

Sean Mussenden of the American Journalism Review concisely expressed what I thought when reading Khazan’s piece. I fact-checked AJR articles in college, and tricking my brain to think I was fact-checking is the only thing that saved me from hurling a rock at my laptop while coding.

Four months ago I was a coding newbie. My crowning achievement was a Python script that determined whether a given string of text was of Tweet-able length. By December, I had cleaned and manipulated datasets in Python, created heat maps and scree plots in R, designed map visualizations in D3, and analyzed my Facebook and Twitter data. I needed the structure and graded homework assignments that graduate school courses in data manipulation, exploratory data analysis, and information visualization offered, but I wouldn’t have survived those classes without the wealth of resources on the Interwebz. These lessons I absorbed may help you meet your code-learning resolutions.

1. Find a tutorial that works for you

Free online tutorials abound. Shop around, take what works, and leave what doesn’t. I’m not suggesting giving up at the first sign of difficulty. Coding is hard, frustrating, tedious, and time-consuming. But it won’t always be. Rewards, even just the personal satisfaction of overcoming challenges, await those patient enough to try. Sink your time into a tutorial that fits your learning style and avoid wasting time on one that doesn’t. Last January I enrolled in a Coursera class on data analysis in R. The description said a programming background was helpful but not required. A week into the course, it was clear: a programming background was definitely required. I couldn’t afford to spend 10 hours on assignments I didn’t understand, so I stopped.

This September, I needed a crash course on Python. I had one week to complete a homework assignment that incorporated everything I learned in a year of basic coding courses. My lifesaver: Learn Python the Hard Way. Just like learning to write the alphabet by tracing over letters, this tutorial teaches the logic of coding by having you type code that’s in front of you. Another assignment required programming in D3, but I had no knowledge of JavaScript. Scott Murray’s D3 tutorials on Aligned Left and his O’Reilly book (which comes with sample files) were a life raft.

2. Google is your friend

Tutorials won’t give you all the information you need, but Google can help. Paste your error message into the search bar to get a sense of what went wrong. Or, (and I found this more effective), type what you’re trying to accomplish. Even the craziest phrase (“after splitting elements in lines in python, keep elements together in for loop”) will get you somewhere. People often share snippets of code on forums like Stack Overflow. Test their code on your machine and see what happens. Debugging is a random walk, requiring you to chase links and try several strategies before that glorious moment when the code finally listens to you. Don’t worry. You’re learning even when you’re doing it wrong.

3. But people are your best friend

I tweeted my frustration with the Coursera class last January. To my surprise, digital storyteller Amanda Hickman responded to my tweets and set up a Tumblr to walk me through the basics of R Studio. People want to help, and their help will get you through the frustration of learning to code. This semester I saw the graduate student instructor nearly every week during office hours, bringing him the specific or conceptual questions that tutorials and Google couldn’t explain me. When you get stuck, reach out. Ask that cousin who works in IT to help you debug something. Post on social media that you’re looking for help. Use Meetup to find fellow coders with whom you can meet face-to-face. Find groups like PyLadies (for Python) and go to their meetings. Don’t let impostor syndrome, or the feeling that you’re not really a “coder” stop you. You are a coder.

4. Take breaks

My first coding professor said, “Don’t spend hours on a coding problem. Take a break and return when your mind is fresh.” LISTEN TO HIM. More than once, I sunk six or seven hours trying to debug code, only to collapse into bed and then solve the problem within an hour the next morning. When coding threatens to consume your life (or unleash dormant violent tendencies),  say, “Eff this for now” and take a well-deserved break.

Happy coding!

Diving into Data: How to Jumpstart Data Analysis in Media Organizations

Take dozens of smart people, given them a ton of information, and demand they make sense out of it under deadline pressure. Journalists know this as life in a newsroom, and they see great work emerge from such an environment every day. The same concept works for service in the age of data, as I witnessed at last weekend’s A2 DataDive.

Students and local residents gathered on the University of Michigan’s campus for two days to crunch data, give back to local nonprofits, and learn a thing or two. Hackathon events such as these offer a model through which media organizations can expand their data journalism efforts and leverage local expertise.

One section of the Data Journalism Handbook describes how a Danish news organization used a hackathon to help journalists and web developers understand each others’ worldviews. Laura Rabaino offers a how-to guide on how to organize a newsroom hackathon after her experience with one at the Seattle Times.

The DataDive is a hackathon with a public service twist. For months before the actual event, the organizers (of which I am one), worked with four local nonprofit organizations to determine what data they had and what they hoped to do with it. At the event, each nonprofit gave a brief presentation that outlined their mission, their data, and their questions.

Then, we let people loose. Anyone who was interested could participate. Volunteers worked all day Saturday and during the morning on Sunday to analyze and visualize the data. On Sunday afternoon, each team presented their findings.

The most heartening and motivating element of the entire experience was seeing just how excited people get.
“I love this stuff,” volunteer Alex Janke remarked while in the middle of a statistical analysis. “I wish I could do this every weekend.”

The sensation of using your skills and hobbies to help another person out is powerful, and it’s why I think the DataDive model is well-suited for media organizations. Similar to nonprofits, many media outlets operate based on a mission of public service. (If your organization is a nonprofit, you can be part of a DataDive in your area. Check out DataKind for more information.) A journalistic DataDive would give non-journos a peek behind masthead and give the news organization a chance to engage with the community.

Of course, watch out for challenges:

  • Be aware of potential culture clash. (See: Obama campaign vs. open-source coders) At the DataDive, we make all of our materials open to the public.
  • Provide volunteers with enough guidance on what you hope to accomplish with the data, but don’t stifle the creativity that makes this event so valuable.
  • Make sure everyone writes down how they’re doing what they’re doing while they’re doing it. You want to be able to replicate (or at least understand) what happened.
  • Finally, don’t run out of coffee!

What questions do you have about the DataDive? Leave a note in the comments.

Policy Provides Context to Understand Data

How many rewards cards hang on your keychain? How many website accounts do you maintain? How much information do you share with organizations? Type your name into Spokeo and see what comes up. Chances are, it’s pretty accurate.

Many places collect personal information; that’s nothing new. But combine the ability to store unlimited amounts of data, aggregate and analyze massive datasets, and instantly release information into the public realm. You get the power to use customer behavior to determine when women are pregnant. You get maps that show addresses of people licensed to own pistols. You get the question of how aggressively to prosecute someone who downloads too many articles.

What are the implications of this? Thinking from a policy perspective can help journalists spur discussions around the role and use of data.

Take the case of Target’s data mining to pinpoint pregnant customers. Companies can link data they collect from customer interactions, data from public records, and data they purchase from third parties to build extremely detailed profiles of people. Do terms of use and privacy policies adequately convey this potential? These terms govern nearly every organization we interact with; is it possible to escape data collection? What policies, organizational or regulatory, can enable consumers to control their own data? Do consumers even care?

Journalists should also consider the policy implications of their own work. For example, the New York-based Journal News obtained gun license data, which was public, and mapped the addresses of those licensed to own pistols. This sparked an outcry among citizens and triggered debate among media circles as to whether “journalists have a free pass to do whatever they want with public-record data.” New York state then passed legislation that removed such information from public access. The incident reminds journalists to ask the question, “What do I hope to accomplish with this story,” at each step of the reporting process.

Government use of data is another area ripe for data storytelling. As Scott Shackford writes:

“The degradation of the Fourth and Fifth Amendments is an academic or theoretical matter for so many people and often lacks a strong human narrative to draw public outrage….Whereas, just about everybody’s on Facebook. Facebook’s privacy systems affect them directly every day, and they see it. So Americans are furious that Instagram might sell their photos, while shrugging at what the federal government might do with the exact same data.”

Data and policy are not independent. For this reason, policy coursework comprises the third leg of my concentration in data storytelling (with data analysis and design being the first and second). Understanding what organizations do with data is as important as using data to present compelling stories.

What data policy issues would you like to see journalists explore? Describe them in the comments below.

You’re a Journalist. Why are you in iSchool?

Good question. I’m in information school (iSchool) because knowing how to interview people and write stories is not enough to succeed as a journalist today.

In earlier eras, mainstream media were the source of facts (re: information). Between the World Wide Web and mobile technology, facts now lie at our fingertips. We don’t need to wait for the morning paper or the nightly news to keep us updated. Facts have become commoditized, but journalists never traded solely in facts. A journalist’s unit of currency is the story, a set of facts that, when taken together, help people make sense out of the world.

Which brings me to data.

Data is everywhere. On its own, one cell from a spreadsheet is useless. But thousands, millions, billions, even trillions of data points taken together produce meaning. Data plus a human to analyze and contextualize it coalesces into knowledge, insight, and conclusions. How can we humans develop these skills? Enter iSchool. Take this list of the 10 things journalists should know coming into 2013. iSchool students build skills, interact with data, manage information, build online communities, design user experiences, build mobile applications, and learn very quickly that change is the norm. That’s more than half the list!

Data, as Ken Doctor writes, enables journalists to go deeper:

“Well-programmed technology can do a lot of journalistic heavy lifting. In part, all the technological innovation simply lets smart journalists ask better questions and get a faster result. It both allows journalists to get questions they know they’d like to answer — and goes a step beyond. Getting at unstructured data opens inquiry to lots of content previously beyond reach. Machine learning, says [Chase] Davis [director of the Center for Investigative Reporting], ‘allows datasets to tell you their stories. You don’t have to be limited by your own experience.’ ”

It also makes business sense to hire a data journalist, as Amy Gahran points out:

Journalists, editors and publishers who make an effort to become data literate may be able to demonstrate a competitive advantage to the communities they serve — and, indirectly, to funders, sponsors or advertisers.

As a student at UMSI, I am creating my own path of study called data storytelling. This includes computer programming and data analysis (to learn how to glean insight from data), graphic and interaction design (to present that insight in a compelling manner), and information policy (to put that insight into context). I also help organize the A2 Data Dive, a service event in which community members and students spend a weekend crunching data for nonprofit organizations.

Join me on this adventure to learn how to interview data and tell its stories. Do you have a thought, idea, or (constructive) criticism to offer? Leave a comment below, send me a tweet, or email me at priyaku [at] umich [dot] edu. Welcome!

(And yes, I treat “data” as singular. As linguist Geoff Nunberg writes:

“Whatever the sticklers say, data isn’t a plural noun like ‘pebbles.’ It’s a mass noun like ‘dust.’)