Last month I gave a tutorial at the Pydata NYC conference on my work using Python’s Networkx library and the open source graphing tool, Gephi. The tutorial covers some fundamental social network theory and then highlights a methodology which I commonly use to analyze communities on Twitter. Here I mapped out the embedded social network amongst all Twitter profiles who had the word ‘python’ in their Twitter user bios. I grabbed this data by identifying users who had been actively tweeting during the period of a week before the conference. Then I generated a graph where each node represents a Twitter users and the edges, follower/following relationships. The larger a node, the more central it is within the community.
I then ran a number of stats on the graph. Modularity is an extremely interesting measure that helps us identify communities within the graph (i.e. regions that are much more interconnected than a random graph would have been). The results were utterly fascinating:
Each color represents a distinct modularity class. There are clearly embedded communities within the Twitter user segment that we observed, as shown by the different colors. If we dive into the profiles that represent each cluster, we can observe clear differentiation mostly based on language, but some, context:
From all Twitter users who have ‘python’ in their bios, the ones that were identified as the dark blue modularity class were posting in English. The turquoise cluster, almost the same size as the English-based one, represents users who post in Japanese (and are most likely located in Japan). The two dominant clusters are quite separate except a layer of connectivity in between. These are potentially “bridge programmers”, those who are connected to both the western Pythonistas as well as the Japanese. The Chinese accounts posting to Twitter are far less connected to the rest of the Python engineers. This might be due to the fact that Twitter is not commonly used in mainland China due to the fact that the services is censored there.
Now for the best bit of all, there’s a tiny purple cluster off to the left which is completely separated from the rest of the graph. When we dive into this section, surprise surprise, we see the true pythons. Yes, Twitter handles such as @247snakes and @WorldOfBail. Folks and feeds obsessed with snakes of all forms. Awesome!
While Twitter as a social network is clearly not wholly representative of our society or professional community, yet the signal we can glean from its embedded networks and clusters are incredibly valuable, especially when looking at communities which are more likely to be on Twitter. The python community mapping above helps us see dominance of python engineers across countries and cultures. Questions of bridging can get extremely interesting with networked data. For example, if you’re looking to hire someone who is involved with the python scene in Japan but is also connected to the scene in the US, using modularity and clustering coefficients can help you identify potential candidates quite easily. Additionally, when seeking an “influential” person from within a community, looking at network properties can help identify important figures, where many times their public metrics (such as number of followers) might not be so revealing.
Here’s an example of a similar mapping I ran for the data science community. I used a very similar methodology to the one described above, only taking users who have one of the following phrases in their Twitter bios: Data Science, Data Scientist, Machine Learning, Data Strateg*
This resulted in a set of 1053 users who posted 14k tweets during the observed period of a week. Amongst those who posted the most were @data_nerd (659 Tweets!), @Chantel_Esworth (562) and @Da5_12 (253). Yet these three VERY NOISY profiles aren’t necessarily the most important or interesting part of the data science community. Here’s how the network looks like:
There’s a hairball-esque tight cluster that represents the majority of the identified community on Twitter, with a few offshoots (BTW – the tight cluster at the bottom right are data-strategy students in Sweden’s Hyper Island). If we dive into the main section, we can get a better understanding of the different clusters that make up the community (zoomable embedded graph below):
Each color represents a modularity class, effectively regions of the graph that are much more interconnected than the norm. The users within each modularity class tend to have some significant attribute/s in common. In the case of the python mapping above, language was the clear differentiator. Here, that’s not the case. This gets very tricky.
With Hilary Mason‘s immense help, we attempted to understand what each region of the graph means. Purple seems to be a mix of east coast and academics, while the dark blue is the west coast data drinking crew. Yellow looks like west coast social network folks while green have been doing it for a while. Although @BigDataBorat is identified within that segment… hmmm… The orange cluster is harder to nail down. Perhaps more academic, applied math and less tech-scene? @seanjtaylor seems to bridge between the two.
Remember, the clusters are based off of embedded social interactions. The fact that more people connected to each other in one portion of the graph is a significant signal. It just isn’t always easy to label it. Additionally, people move between jobs/cities all the time. The fact that someone may be highly connected to the west coast data science scene doesn’t necessarily mean that they are physically a part of it. Monica Rogati (@mrogati) is identified as more interconnected with the east coast group of dataists even though she’s out west, working at LinkedIn. This could be due to the fact that she spent many years at CMU. Or perhaps actively maintains connections to the data science community back east.
With these type of mappings, many times the community itself is much better at understanding what the segments mean. Obviously, this doesn’t necessarily represent all the important people in the community, only those who are active on Twitter. There’s inherent bias towards those who have been using Twitter for longer, as their networks tend to be more developed. Hoping to get a few friends to help me out with the classification here!