I had the pleasure to meet BBC’s Matt Danzico last week and chat about measuring the effectivity of online activism. We (unsurprisingly) chatted about Twitter data, which gives us the ability to identify dense clusters of users who are actively participating in the observed event. In the case of the #equality hashtag, heavily used last week during the height of the marriage equality / prop 8 supreme court hearing, a number of distinct clusters of users emerge. For example, the close-knit group of users illustrated by the light green cluster in the bottom left portion of this graph represents a community of users who are all somehow affiliated with Lady Gaga’s Born This Way foundation. Both the official No on 8 campaign and the Human Rights Campaign twitter handles are central to the network of users posting to the hashtag and the campaign at large. Perez Hilton is also clearly a central figure within this community.
I can’t figure out how to embed the video, but feel free to see it here!
When I talked to John Borthwick a couple of years ago about joining Betaworks, we played with the idea of what a shared data layer across the portfolio companies might look like. Many of the data-related challenges that early stage startups face are quite similar – distributed data aggregation, indexing, stream processing, counting stuff, ranking and classifying content. Just like design is successfully architected as a shared resource across the Betaworks network, can we create a function that would be useful across the board, but for data?
Back in 2011 it didn’t seem to make as much sense. SocialFlow had just signed the firehose deal with Twitter and the company was rapidly taking off. I loved Frank’s vision, and the team he assembled. So I set to work fulltime on helping build out the company’s data underpinning and start running various data-driven research projects across its immense data stream. The past two years have been a truly humbling learning experience. I feel very lucky to have had the opportunity to be a part of the SocialFlow team, seeing it grow from the 6 of us hacking around a table, to the significant player it has become. SocialFlow is changing the way publishers and brands are making sense of their networked audiences, and doing so in an innovative and scientifically-driven manner.
Starting next week, I’m heading back to beta. I’ll be taking on the role of Chief Scientist at Betaworks, working on building out this data layer that we devised a couple of years ago. The time is ripe, with a number of very exciting, data-heavy initiatives at early stages. We have a few ideas on how this might work. Needless to say, I’m wholly excited by the challenge. I love the Betaworks community, its experimental nature and innovative grasp of media. I’ll be staying involved with SocialFlow, and will always be its biggest advocate.
Since this Pew report came out, researchers and journalists in my circles have been trying to untangle what it actually means. One of its’ interpretations is that Twitter is full of haters. Another reads that Twitter is a mainstream liberal but a conservative wonk (Srsly?). The notion that opinions raised on Twitter are biased since the population of active users on the network is not representative of the general public makes a lot of sense. In the report, Pew researchers monitored opinions on Twitter across a number of political events during 2012 using Crimson Hexagon’s sentiment analysis service. At the same time, they ran national polls for sentiment around the chosen events. While they conclude that ‘Twitter reaction to events (is) often at odds with overall public opinion’, it seems like what they actually prove is that Crimson Hexagon’s (CH) sentiment analysis method for Twitter doesn’t reflect public opinion… and therefore is meaningless for assessing public sentiment on Twitter.
These two conclusions are very different. The former suggests that Twitter cannot be used at all to assess opinions, while what I’m suggesting is that language-based statistical models from Tweets in aggregate will not provide meaningful results when evaluating general public sentiment around an event. If context around users is not taken into account, specifically their historical propensity to respond to a topic, as well as their positioning within the network, we lose the ability to gain interesting insight from data coming from social networked spaces. Let me explain.
***
I spent some time looking into CH’s documentation online, and while I feel like I have much better handle on what they do, I’m still partly guessing, as much of the meat counts as “proprietary algorithms”. From my understanding, CH gets access to the Twitter firehose, then for every project, a series of keywords and phrases contained within a time period are chosen, and all tweets where the keywords appear are extracted as the observed dataset. All non-English tweets are then filtered out (not sure exactly how this is done, and what happens with mixed language Tweets and/or slang). The chosen period seems to vary, from several hours to multiple days. For the Pew studies over the past year seems like the chosen time periods vary for every every observed event.
In the next step, sentiment-related “assertions” are identified across all tweets. This is most likely done using a pre-existing dictionary of words and phrases that are based on a manual classification (labeled dataset) of tweets from the past. Then a random sample of assertions from the observed event is manually labeled. This is used to train a classifier which then runs across the whole dataset of assertions, breaking them down into 4 bins: positive, negative, neutral (informational) and irrelevant.
I’ve seen this type of methodology work well for long-form text, but have yet to see interesting results come out of Twitter data. Some of my main concerns are outlined below:
1. We cannot expect to assign sentiment correctly 100% of the time. Even humans often disagree about the sentiment of text:
Twitter is spoken, non-homogeneous language that is constantly evolving. It is very hard to train a classifier to accurately represent a single model for “language” on Twitter.
Cultural differences (“sick” – positive for some, negative for others).
Sarcasm and innuendo (“old men and women” - Chomsky’s constructional homonymity) – this is a crucial problem, especially around political content.
How are hashtags dealt with? Many of them, especially around political events, are words never previously seen by a model.
How does CH know that a person is situated in the US? They filter out English content, but this doesn’t necessarily mean the person is located in the US.
2. There’s no user context around the tweets:
What if a user tweets multiple times during an event? This should be taken into account, as it highly skews the results.
What about retweets? They are an interesting signal reflecting user opinion, but there are many reasons why someone might retweet a message, and repetitive retweets from a single person reflecting effectively a single opinion should not be counted multiple times.
Why do we think that the general sentiment of an event is simply the sum of sentiment in individual tweets? At the end of the day, the data is coming from people. If we don’t understand who these users are, especially how they’re interconnected and what a meaningful sample would be, the sum of all tweets is meaningless.
3. If a service claims a 97% accuracy rate, especially on a problem that’s not considered solvable, you should be highly suspicious. How do they define accuracy here? Based on how their models classify their trained dataset? We have to be very careful here.
***
Given that this is all proprietary technology makes it incredibly difficult to critique. I’ve yet to see Twitter sentiment analysis results that are actually meaningful – highlighting interesting, important and timely insight. Topsy claims that with #Twindex they accurately predicted election results for almost all US states. I haven’t played around with CH or Topsy as they are quite pricey. But I have spent the past couple of years working on insight from Twitter data, and while it is straightforward to identify extremely positive and extremely negative posts, the majority of content tends to land somewhere in the middle. From my experience, using natural language-based models on Twitter data without any context around the users or the observed event will not bring sufficiently valuable insight.
Is it totally useless? I don’t think so. There’s value in understanding amount of buzz around an event, especially if there’s enough data about what normal behavior looks like. Yet IMHO, some of the most interesting insight can come from taking a networked approach to analyzing user response to an event.
For example, on election day 2012, we saw over 100,000 users who self reported their vote, literally tweeting out “I voted for …”. Obviously we know that the network is biased and that there are many more young liberals using Twitter (as noted by Pew’s latest study on the demographics of social media users). We can account for this bias, but additionally, we can start to identify communities of users: teenagers in Florida, moms in Ohio, media professionals across the US. Communities are inherent to the organizational scheme of interest-based social networks such as Twitter. By getting more context on how heavily each community is involved within an observed event, and sampling accordingly, we may be able to significantly improve the way we gage the general public opinion through the lens of Twitter.
Does this approach better align with the formal polls? I have no idea.
But a network-based approach may help give us important context around event polling – a smarter way to sample user data coming from social networks. In any case, we need to continue to experiment in this space and be much more critical of what we’re told by companies who promise algorithmic accuracy.
***
Some related links:
A recent example of community cluster analysis looking at a specific event - the Harlem Shake.
Watch out for Marshall Kirkpatrick’s new startup - Little Bird - an interesting actor in this space.
Alex Johnson, Journalist at NBC, blogs about his usage of Crimson Hexagon’s sentiment analysis technology.
I had the honor to participate in Harvard Law School’s behavioral economics and social media conference, organized by Cass Sunstein last week. Scholars from across Harvard along with folks from Facebook, Twitter, Microsoft Research and SocialFlow discussed important trends around social media, theory and practice and its potential to help us assess behavioral change. As part of the ‘theory and practice’ session led by Yochai Benkler, I presented alongside Facebook’s Eytan Bakshy and Sharad Goel of MSR.
Nate Matias, research assistant at the MIT Media Lab’s Center for Civic Media put together a comprehensive writeup of the session. Sean Laurence of Boston Startup School put together full audio of the event here. Following is a crib of my presentation on the promise of realtime data from social networks.
=-=-=-=-=
I’ll start with a short story.
I just moved into a larger apartment in New York City and finally have enough space for a piano. So I did what many do, and start obsessively researching the web for used upright pianos. From Craigslist to Google to rental stores, the task is actually quite difficult given the variety in types, sizes and prices.
It didn’t take long before piano ads started following me around the internet. As I consumed the news, I saw ads for Yamaha pianos. When I went to YouTube, ads for Steinway. Even when reading my daily Mashable quota…, more piano ads. Following me around as I browse the web, regardless what I was doing, making me feel terrible for being that indecisive procrastinator who can’t seem to make up his mind.
My anger at the ads quickly turned into pity. Faust Harrison Pianos were clearly users of the latest in digital marketing strategy wonders, buying against user behavior stored in cookies within people’s browsers. I must’ve clicked on their website at some point in time, and since then, my browser has a cookie that signals my interest in acquiring a piano. True the intent is there. But believe it or not, it is not the only thing I think about throughout the day. The last thing I’d want is to be reminded every minute of every day that I still have to make this decision.
As ads attempt to become more “relevant” either by matching to our browsing history or to friend association, they are doing more harm than good if they do not understand the user’s context, and more importantly what someone is willing to be attentive to. Intent used to be the biggest buzzword around search engine conversations. Back in the day, the thought was that If we could identify someone’s intent we could present them with relevant information. They got that right with my search. But where these ads completely failed at was understanding my context as well as my personal psychology around purchasing. Its been over 10 years since Google innovated and changed the world of advertising. Is cookie-based ad targeting *really* the best we can do?
=-=-=-=-=
It felt good to see that I’m not alone here. Digg’s @tolar claims that he visited Urban Airship once and now can’t escape their ads:
i visit urban airship ONCE, and now i can’t escape their ads trying to get me to integrate passbook on @digg. WTF
When I search google for ‘ads stop foll…‘ I clearly see that other users experience the exact same thing.
Various ad-blocking services have sprung up, FixTracking.com has some information. Otherwise many informed users make sure to clear their browser cookies on a daily basis. Is this really the type of ecosystem we want to support? Where the technically informed are able to block ads from chasing them around the web, while the majority deal with the consequences?
=-=-=-=-=
Enter Social Media.
So much has been written, discussed and examined about the shift that we’re seeing with the popularity of social networked spaces. The networked nature of these spaces mean that our old ways of dealing with audiences has got to change. Power has to be renegotiated, and in many cases doesn’t come top-down, but rather from loosely connected points in the network.
In order for information to spread, people along the way must be attentive and choose to pass the tweet or status update onwards. As the threshold to publishing content nears zero, getting people to be attentive has become a scarce commodity. One cannot demand or even expect someone’s attention at any given point in time. As James Gleik puts it in his seminal book, The Information, “When information is cheap, attention becomes expensive“. You don’t need to take Intro to Macro-economics to get this.
=-=-=-=-=
The following plot does a great job at expressing how attention shifts within social networked spaces. The green line represents the number of tweets over time that had the word ‘Superbowl’ in them while the blue, the word ‘power’. This is measured over time across all publicly posted Tweets between February 3rd and 4th. Note the clear switch that happens when the Superdome goes dark. Attention shifts from the game which abruptly stops, to focus on the fact that half of the stadium loses power.
What evolves online, is the poster child example of how realtime information can be used to inform marketing campaigns. It took minutes for Oreo to come up with an innovative advertisement in response to the blackout, which got them a significant level of visibility (16k retweets and 6k favorites so far only on Twitter). Twitter reported that it took just 4 minutes for someone to buy promoted tweets against searches for the phrase “power outage”. Other brands quickly responded as well, catering to the millions of sports fans who were following the chain of events happening in the stadium. Having flexibility and changing the frame to what people were attentive to, the power outage, clearly paid off.
We see these kinds of attention shifts happening all the time, whether affecting a wider region of the network, or a localized audience.
=-=-=-=-=
Using information from social networks can help us understand the context switches happening amongst audiences and generally within populations in realtime and over time. What people are attentive to and how that changes over time. In a study that Suman Deb Roy, our summer intern from last year defined and measured what he called audience volatility – the frequency of change in topics at the focus of an observed group of users. The higher the volatility of an audience, the less focused it is, as there’s a wide array of topics at play. The lower the volatility score, the more focused an audience.
For example, when we measured the volatility in Twitter’s trending topics across different cities we could see clear peaks and troughs in volatility. Remember, the higher the graph, the more volatile the trends within that city. Whats fascinating about this plot is the lowest point marked with an arrow. This happens around the second week of March, 2012, and represents a point of heightened focus across all major cities in the United States.
The lowest point on this graph is the day that Invisible Children launched their #Kony2012 campaign. This is the point of lowest volatility / maximum focus, showing just how good that campaign was at capturing people’s attention in all major cities across the United States
=-=-=-=-=
Why is all this important?
This is the first time that we can clearly identify spikes in user attention, what groups of people are focused on, in realtime, and over time. We don’t have to wait for market research and poll results, but rather we can plug into this information. Additionally, we have a way to quantify these shifts, seeing just how much effect real-world events have on groups of people online, how much focus they choose to devote to said event.
As we get better at understanding of user interaction within social networks, we’ll get a more holistic view of whats going on. While there is still benefit in planning campaigns and taking the time to think through their design, social networked spaces bring with them the hope for a more nuanced understanding of user behavior, intent as well as context.
Maybe soon advertisements will stop following us around the web, and pop up in the right context, at the right time.
Am I too optimistic? Maybe. But I still want to get that piano!
Last month I gave a tutorial at the Pydata NYC conference on my work using Python’s Networkx library and the open source graphing tool, Gephi. The tutorial covers some fundamental social network theory and then highlights a methodology which I commonly use to analyze communities on Twitter. Here I mapped out the embedded social network amongst all Twitter profiles who had the word ‘python’ in their Twitter user bios. I grabbed this data by identifying users who had been actively tweeting during the period of a week before the conference. Then I generated a graph where each node represents a Twitter users and the edges, follower/following relationships. The larger a node, the more central it is within the community.
I then ran a number of stats on the graph. Modularity is an extremely interesting measure that helps us identify communities within the graph (i.e. regions that are much more interconnected than a random graph would have been). The results were utterly fascinating:
Each color represents a distinct modularity class. There are clearly embedded communities within the Twitter user segment that we observed, as shown by the different colors. If we dive into the profiles that represent each cluster, we can observe clear differentiation mostly based on language, but some, context:
From all Twitter users who have ‘python’ in their bios, the ones that were identified as the dark blue modularity class were posting in English. The turquoise cluster, almost the same size as the English-based one, represents users who post in Japanese (and are most likely located in Japan). The two dominant clusters are quite separate except a layer of connectivity in between. These are potentially “bridge programmers”, those who are connected to both the western Pythonistas as well as the Japanese. The Chinese accounts posting to Twitter are far less connected to the rest of the Python engineers. This might be due to the fact that Twitter is not commonly used in mainland China due to the fact that the services is censored there.
Now for the best bit of all, there’s a tiny purple cluster off to the left which is completely separated from the rest of the graph. When we dive into this section, surprise surprise, we see the true pythons. Yes, Twitter handles such as @247snakes and @WorldOfBail. Folks and feeds obsessed with snakes of all forms. Awesome!
While Twitter as a social network is clearly not wholly representative of our society or professional community, yet the signal we can glean from its embedded networks and clusters are incredibly valuable, especially when looking at communities which are more likely to be on Twitter. The python community mapping above helps us see dominance of python engineers across countries and cultures. Questions of bridging can get extremely interesting with networked data. For example, if you’re looking to hire someone who is involved with the python scene in Japan but is also connected to the scene in the US, using modularity and clustering coefficients can help you identify potential candidates quite easily. Additionally, when seeking an “influential” person from within a community, looking at network properties can help identify important figures, where many times their public metrics (such as number of followers) might not be so revealing.
Here’s an example of a similar mapping I ran for the data science community. I used a very similar methodology to the one described above, only taking users who have one of the following phrases in their Twitter bios: Data Science, Data Scientist, Machine Learning, Data Strateg*
This resulted in a set of 1053 users who posted 14k tweets during the observed period of a week. Amongst those who posted the most were @data_nerd (659 Tweets!), @Chantel_Esworth (562) and @Da5_12 (253). Yet these three VERY NOISY profiles aren’t necessarily the most important or interesting part of the data science community. Here’s how the network looks like:
There’s a hairball-esque tight cluster that represents the majority of the identified community on Twitter, with a few offshoots (BTW – the tight cluster at the bottom right are data-strategy students in Sweden’s Hyper Island). If we dive into the main section, we can get a better understanding of the different clusters that make up the community (zoomable embedded graph below):
Each color represents a modularity class, effectively regions of the graph that are much more interconnected than the norm. The users within each modularity class tend to have some significant attribute/s in common. In the case of the python mapping above, language was the clear differentiator. Here, that’s not the case. This gets very tricky.
With Hilary Mason‘s immense help, we attempted to understand what each region of the graph means. Purple seems to be a mix of east coast and academics, while the dark blue is the west coast data drinking crew. Yellow looks like west coast social network folks while green have been doing it for a while. Although @BigDataBorat is identified within that segment… hmmm… The orange cluster is harder to nail down. Perhaps more academic, applied math and less tech-scene? @seanjtaylor seems to bridge between the two.
Remember, the clusters are based off of embedded social interactions. The fact that more people connected to each other in one portion of the graph is a significant signal. It just isn’t always easy to label it. Additionally, people move between jobs/cities all the time. The fact that someone may be highly connected to the west coast data science scene doesn’t necessarily mean that they are physically a part of it. Monica Rogati (@mrogati) is identified as more interconnected with the east coast group of dataists even though she’s out west, working at LinkedIn. This could be due to the fact that she spent many years at CMU. Or perhaps actively maintains connections to the data science community back east.
With these type of mappings, many times the community itself is much better at understanding what the segments mean. Obviously, this doesn’t necessarily represent all the important people in the community, only those who are active on Twitter. There’s inherent bias towards those who have been using Twitter for longer, as their networks tend to be more developed. Hoping to get a few friends to help me out with the classification here!
My ipython notebook code snippets can be found on Github, my slides here, and a video of the presentation embedded below:
Had the honor to present some data from yesterday’s VP debate at Bloomberg TV’s “Money Moves” show. I showed some Twitter visualization highlighting the different topics at play, as well as people’s perception of who won the debate (hint: there are a surprising number of people who think Ryan won Biden…) More below:
Last week I had the honor to give the opening keynote at Dalhousie University’s symposium on measuring influence on social media. Its not common to see folks from industry keynoting academic events, so I was shocked when Anatoliy Gruzd from Dal’s Social Media Lab asked if I’d be the opening keynote at the symposium. I think a lot about the topic of influence, and have done a lot of work untangling what can be measured through data. Below I’m attaching a rough crib of my presentation, as well as my slide deck:
————————
The promise of data brings us hope that we can finally quantify the effects of social influence, giving us the opportunity to place a better price tag on certain digital spaces or interactions, potentially making our ecosystem much more efficient. We can finally attempt to answer questions such as: how are people activated, and what causes folks to purchase a product or pass along a piece of information.
Marketers and media alike tend to generate hype around status affordances which are plastered all over social network sites. These are metrics such as – number of followers, mentions, comments, fans, and so on – used within social network spaces to highlight user status. It is easy to get swept away by these readily available metrics without necessarily knowing what they mean (if you haven’t seen this yet, check out Colbert’s Internet Numbo-Tron 3000 skit: when tracking tweets per minute means… absolutely nothing!).
Influence as an Exposed Metric
I like the following definition of influence in social spaces:
The ability to disproportionately affect the spread of information.
In my work I’m extremely interested in how information spreads. For this reason, I look for points of influence when users get others to be attentive to a piece of information or media. If you’re a consumer brand, interesting points of influence for you are cases where a friend gets another to purchase an item. There’s always a wanted outcome in the form of an action: information spread, purchasing an item, viewing a TV show, etc.
Yet influence as an exposed metric is problematic for many reasons. We don’t think of providing a simple quantifiable measure for love, hate or trust. Yet we expect to do so with Influence. Can you tell me how much of your thinking is *innately* yours? What percentage of your thoughts are a direct result from advertising campaigns? What made it into your head because of peers and what are your original thoughts? Some say that influence has more to do with what is unconscious, the ways in which our brain picks up bits of information and formulates them together into an opinion or preference.
On top of that, people aren’t necessarily rational in their approach to trust. I may trust someone and continue to be influenced by their recommendations despite past transgressions. Some may bring influence from outside the network – a celebrity, a public figure. How does the fact that they attain influence outside the observed network affect our measurement? I haven’t seen anyone able to quantify and match the effects of influence across networks. And what about context? I shouldn’t be deemed an influencer on “popcorn” just because my tweet from the theater was retweeted by others (*cough* Klout *cough*).
Social recommendations happen between peers, friends and family members all the time. This is not new. Whats different now is that these moments of influence may be visible to us through the lens of data.
Networked Influence
The key to understanding influence is to look at the system as a whole, and think about users and how they’re interconnected rather than trying to identify specific people, or “influencers”. Users serve as information brokers, choosing what to give their attention to. But what drives these choices? And more importantly, can they be predicted?
I’m interested in a broader notion of influence. Not strictly peer to peer, or lists of these so-called “influencers”, but rather the effect on a community. I think of influence in the context of a networked ecosystem. Can we identify network attributes that create a higher likelihood for our wanted outcome? Can we figure out points in time when the network comes together in ways that will most likely help a message spread? An obvious but effective attribute is time of day. If your audience is mostly located within a certain geographic region, its best to publish content during the day (in that timezone) or else the majority of your audience will be sleeping. That’s just the start.
Based on recent experiments, Duncan Watts and Peter Dodds claim that going viral has more to do with the receptivity of an audience rather than the people doing the sharing, tagging and endorsing. They claim that role of “influencers” has been overstated:
“highly influential people were more effective than the average person in triggering social epidemics. But their importance was far less than the “overall structure of the network”: what matters far more to an idea, candidate, or product going viral is that the networks of people are easily influenced and networking with others who are easily influenced.”
“Twitter mega-influencers did generate greater cascades, but not regularly. Their ”hits” were sporadic and inconsistent, while newer and less influential Twitter users had breakout retweets because of the subject, topic, or timing.”
Sinan Aral, an assistant professor at NYU’s Stern School of Business and an authority on social contagion, studies the ability to identify susceptible members as a way to predict influence. The network is chaotic, can be sporadic and inconsistent in terms of what generates large information flows. By focusing on understanding a group of users, how they’re interconnect, when they’re active and what topics “activate” them (what they’re susceptible to) we can start seeing patterns emerge.
A Bit about SocialFlow
SocialFlow is a technology startup in New York City that optimizes publishing to Twitter and Facebook for media outlets and brands. Lets say if you’re The Economist, you have hundreds of articles published to your website on a daily basis. How do you choose what to post to Twitter/Facebook and when to do that? It is clear that there are diminishing returns the more you publish to social channels, meaning, you see substantially less clicks per shared link, and more unsubscribes if you overload people’s feeds with your content. So you have to pick out a few articles and make sure to post them at certain times of the day.
This is exactly what we do. We take in a feed of content that could be published to Twitter and Facebook. And based on a whole slew of metrics, we decide which article to post, and when to post it. How exactly do we do this you may ask?
SocialFlow is a data powerhouse. We ingest around 2TB of data per day. We work very closely with Twitter and consume whats called the public firehose – receiving any publicly posted tweet into our systems in realtime. Then we have multiple systems that index, track and count various attributes of this data. For example, we care deeply about audiences, so we run a wide array of stats on audiences (e.g. followers of a given account).
The Data
At SocialFlow we use a number of metrics to try and predict which piece of content is most likely to yield the highest level of responses at any given point in time. We look at audience activity – how active is an audience at any given point in time, who from the audience is active. We also look at historical activity – what has activated my audience in the past? what have folks retweeted in the past. And in general, whats happening in the network - Is it peaking out of the ordinary? Are there conversations that are taking off in unusual ways?
We constantly look at the impact of events on the network. By understanding whats normal, we can better identify events that deviate from the norm. This gives us the ability to quantify the impact of an event, or its’ “influence” on the network. In my presentation, I present a number of examples: a major football game, the Aurora Colorado shooting and Whitney Houston’s death. In each case we see clear deviations from the norm, and identify a unique pattern – one representing a typical sports match, while the other, a typical breaking news event. The Aurora shooting displays a very different curve, due to consequences illustrated by this blog post.
If we go back to our definition of influence, it is important for us to understand what the network normally looks like, so that we can identifying deviations from the norm. In each case we can quantify the level of influence an event had on the network, by comparing to the norm. Next I highlight event classification. The better we get at classifying an event to one of multiple bins, the better we understand its attributes: how much time will a trend persist, when it will most likely peak, how fast it will decline and how far (geographically) it will spread. We identify point in time where audiences are in “sync”, focused on a single topic, versus points in time where there’s much more volatility, many topics are at play.
Networked Audiences and Information Flows
Next we take a look at the shape of an audience. One question that I’m very interested in, is whether a highly clustered network is more susceptible to the spread information compared to a network which are less dense. In the case of Kony 2012 we identified pre-existing communities amongst the initial users who heavily shared the video. These different parts of the network “lit up” at the same time, getting the topic trending across different cities at the same time, generating a snowball effect. This wasn’t simply a viral video that was randomly placed online and spread like wildfire, but rather the effect of a highly organized group and a pre-existing network that was set on spreading the content.
Similarly we see different events “light up” the part of the network that’s relevant to the context of the event. Coupons and deals light up one part of the population, while the political debates another. Each group that’s lit up is susceptible within that context.
Next I illustrate two examples of information flows. In the first, showing how a hashtag spreads, it is clear that the node with the most followers (a.k.a. the “influencer”) is not the most important node in the flow, but rather the node bridging between the original content creator and this highly followed node. Without this bridge, the information would never have spread, hence the node with the most influence within this specific flow is not necessarily the most highly followed, but rather the best positioned in terms of network and interest.
The second example is the case of @KeithUrbahn in the breaking news about the Osama Bin-Laden raid. Two users played a very important role in this information cascade, by re-contextualizing the information coming from Keith Urbahn, and giving their trust. Both @JakeSherman and @BrianStelter saw Keith’s tweet, and wrote that he is a trusted source due to his close connection to Donald Rumsfeld. Information that requires a little more digging, but when used at the right time, helps the network gain trust, and thus, information spreads at an incredibly rapid rate.
Finally
Instead of focusing on lists of “influencers” think about the network of users that you’re trying to understand. How does it behave usually, and when does it deviate from the norm. Think about audience receptivity – what topics light up your fans or followers? In aggregate there are definite patterns here. Think about network attributes of your audience – its shape, how clustered users are and who are your most central users?
Think about bridges, connectors, those that can help take a piece of information from an interesting source, to users with an audience. And always have in mind what’s the outcome that you’re trying to attain. Whether clicks, web traffic or product purchase, influence should always be mapped out to a wanted action from a chosen population.
Earlier this week I was invited to participate in Bloomberg TV’s Market Makers to talk about data from last week’s presidential debate. The segment was shot the following morning after the debate. Even with such short notice, we managed to show a few interesting views of the data:
1. Even though Romney is said to have won the debate, when you look at social data, #Obama2012 appears much more prominent and central. This might be happening because there are more users on Twitter rooting for Obama. Or perhaps this reflects a much more organized campaign, using a single hashtag for all of their communications.
2. We can clearly identify two different topic spaces amongst the Republicans – one is Romney’s campaign, and the other, Tea Party / #tcot. The conversation around Romney is much more fragmented than the conversation around Obama.
3. We identified three dominant clusters of users from Ohio discussing the debates. There was a clear political cluster, a media cluster, and surprisingly, a dominant cluster of users from Ohio State University.
Video of the interview along with graphs are embedded below:
Some more information about the graphs:
First, I highlighted a simple graph showing the different curves that represent each of the prominent debate hashtags. Obviously #debates was substantially larger compared to #Obama2012, #Romney2012 and even #BigBird. That said, the fact that the other hashtags didn’t spike as much, doesn’t mean they were not dominant within the discussion online.
Next I presented a network graph that maps out prominent hashtags and user mentions during the first presidential debate. It is clustered by modularity, which means that hashtags/user mentions that appeared together in higher than usual levels, will be under the same color.
Here’s a zoomed in version:
And here’s #bigbird / #pbs:
The next graph maps out the friend/follow relationships between a segment of users who were discussing the #debates on Twitter. In this case, we see users from Ohio, or those affiliated with Ohio, and how they’re interconnected. Again, the graph is clustered by modularity, where three distinct clusters emerge.
The first (yellow, top right), seems to be politicos from Ohio, including @JohnKasich (governor), @johnboehner (Ohio congressional rep and speaker of the house) and @robportman (Ohio senator). The second (purple, middle right) are Twitter handles that represent local media in Cleveland and across Ohio such as @clevelanddotcom and @WEWS. While the third dominant cluster (green, bottom right) users from Ohio State University who formed a significant part of Ohio-ans discussing the debate.
Last week I gave a keynote the Personal Democracy Forum (#PDF12) in New York City - Networked Power: what we learn from data. PDF is an incredible gathering of some of the smartest folks working on understanding the idea of Personal Democracy, where every citizen is a full participant. In my presentation, I focused on the networked characteristic of our media ecosystem and the new form of power that networks attain. I described ways in which we at SocialFlow analyze networked audiences, mapping out attributes such as activity levels, topical interest, engagement and their evolving friendship-based shape. Some key points from the presentation:
My Network != Your Network: networks of followers, audiences differ substantially in time of day, activity, engagement and shape.
Overarching generalizations will most likely be misinforming.
We need to better understand networked dynamics such as information flows, audience intersections and the effect of algorithmic curation.
Networked spaces are NOT a pure meritocracy: certain positions are advantageous.
A couple of weeks ago we were pleased to spend some time with the folks from UNICEF, analyzing and discussing their #SahelNow campaign. The campaign is focused on drawing attention towards the food crisis unfolding in the Sahel region in West and Central Africa.
The campaign has seen a substantial rise in references, including participation from a number of major celebrities. We helped UNICEF analyze and understand hashtag usage across Twitter by looking at a few different aspects of the data:
Time Series Data: by mapping out levels of hashtag references we could identify prominent points in time where the conversation was spiking out of the ordinary
Phrase Co-occurence: we generated a network graph view of all the related concepts referenced in Tweets along with the hashtag (concepts = phrases, other hashtags, users)
Friendship Graph: we extracted the underlying network of relationships amongst those users who referenced the #SahelNow hashtag, in effect identifying dense clusters of users who were actively promoting the cause in their region.
Here’s a video highlighting some of the data manipulation we ran using gephi: