Hi I’m Gilad

I love data, analysis and visualization. Chief data scientist at beteaworks.

Mapping Twitter’s Python and Data Science Communities

Last month I gave a tutorial at the Pydata NYC conference on my work using Python’s Networkx library and the open source graphing tool, Gephi. The tutorial covers some fundamental social network theory and then highlights a methodology which I commonly use to analyze communities on Twitter. Here I mapped out the embedded social network amongst all Twitter profiles who had the word ‘python’ in their Twitter user bios. I grabbed this data by identifying users who had been actively tweeting during the period of a week before the conference. Then I generated a graph where each node represents a Twitter users and the edges, follower/following relationships. The larger a node, the more central it is within the community.

I then ran a number of stats on the graph. Modularity is an extremely interesting measure that helps us identify communities within the graph (i.e. regions that are much more interconnected than a random graph would have been). The results were utterly fascinating:

Each color represents a distinct modularity class. There are clearly embedded communities within the Twitter user segment that we observed, as shown by the different colors. If we dive into the profiles that represent each cluster, we can observe clear differentiation mostly based on language, but some, context:

From all Twitter users who have ‘python’ in their bios, the ones that were identified as the dark blue modularity class were posting in English. The turquoise cluster, almost the same size as the English-based one, represents users who post in Japanese (and are most likely located in Japan). The two dominant clusters are quite separate except a layer of connectivity in between. These are potentially “bridge programmers”, those who are connected to both the western Pythonistas as well as the Japanese. The Chinese accounts posting to Twitter are far less connected to the rest of the Python engineers. This might be due to the fact that Twitter is not commonly used in mainland China due to the fact that the services is censored there.

Now for the best bit of all, there’s a tiny purple cluster off to the left which is completely separated from the rest of the graph. When we dive into this section, surprise surprise, we see the true pythons. Yes, Twitter handles such as @247snakes and @WorldOfBail. Folks and feeds obsessed with snakes of all forms. Awesome!

While Twitter as a social network is clearly not wholly representative of our society or professional community, yet the signal we can glean from its embedded networks and clusters are incredibly valuable, especially when looking at communities which are more likely to be on Twitter. The python community mapping above helps us see dominance of python engineers across countries and cultures. Questions of bridging can get extremely interesting with networked data. For example, if you’re looking to hire someone who is involved with the python scene in Japan but is also connected to the scene in the US, using modularity and clustering coefficients can help you identify potential candidates quite easily. Additionally, when seeking an “influential” person from within a community, looking at network properties can help identify important figures, where many times their public metrics (such as number of followers) might not be so revealing.

Here’s an example of a similar mapping I ran for the data science community. I used a very similar methodology to the one described above, only taking users who have one of the following phrases in their Twitter bios: Data Science, Data Scientist, Machine Learning, Data Strateg*

This resulted in a set of 1053 users who posted 14k tweets during the observed period of a week. Amongst those who posted the most were @data_nerd (659 Tweets!), @Chantel_Esworth (562) and @Da5_12 (253). Yet these three VERY NOISY profiles aren’t necessarily the most important or interesting part of the data science community. Here’s how the network looks like:

There’s a hairball-esque tight cluster that represents the majority of the identified community on Twitter, with a few offshoots (BTW – the tight cluster at the bottom right are data-strategy students in Sweden’s Hyper Island). If we dive into the main section, we can get a better understanding of the different clusters that make up the community (zoomable embedded graph below):

Each color represents a modularity class, effectively regions of the graph that are much more interconnected than the norm. The users within each modularity class tend to have some significant attribute/s in common. In the case of the python mapping above, language was the clear differentiator. Here, that’s not the case. This gets very tricky.

With Hilary Mason‘s immense help, we attempted to understand what each region of the graph means. Purple seems to be a mix of east coast and academics, while the dark blue is the west coast data drinking crew. Yellow looks like west coast social network folks while green have been doing it for a while. Although @BigDataBorat is identified within that segment… hmmm… The orange cluster is harder to nail down. Perhaps more academic, applied math and less tech-scene? @seanjtaylor seems to bridge between the two.

Remember, the clusters are based off of embedded social interactions. The fact that more people connected to each other in one portion of the graph is a significant signal. It just isn’t always easy to label it. Additionally, people move between jobs/cities all the time. The fact that someone may be highly connected to the west coast data science scene doesn’t necessarily mean that they are physically a part of it. Monica Rogati (@mrogati) is identified as more interconnected with the east coast group of dataists even though she’s out west, working at LinkedIn. This could be due to the fact that she spent many years at CMU. Or perhaps actively maintains connections to the data science community back east.

With these type of mappings, many times the community itself is much better at understanding what the segments mean. Obviously, this doesn’t necessarily represent all the important people in the community, only those who are active on Twitter. There’s inherent bias towards those who have been using Twitter for longer, as their networks tend to be more developed. Hoping to get a few friends to help me out with the classification here!

My ipython notebook code snippets can be found on Github, my slides here, and a video of the presentation embedded below:

19 comments to Mapping Twitter’s Python and Data Science Communities

  • Beautiful! Of course it’s easier for me to leave a “negative” (suggestion for improvement) feedback than fawn for sentences over the many positives. You tweeted to @hmason that you wanted to include more of the Pythonistas. One idea could be to use Python as a topic and then find words that are associated to Python but less associated to regular language. For example pandas (package names), pycon (conference names), or xrange (function names).

  • I love this! A quick observation about the bottom plot: I recognize several of the prominent usernames in the orange region as folks who are involved in or connected to the scikit-learn community. I wonder if scikit-learn has a big enough sphere of influence to be the root of this cluster?

    Great work – loved the talk at PyData as well.

  • giladlotan

    @isomorphismes – Ayup. That would be a good approach to capture more of each community. Could be done using topics/words and also could be done using social ties – effectively users who are heavily followed from within either the python or data science observed community, but don’t necessarily have any related signal in their bio field.

  • giladlotan

    @Jake – Thank you!
    That’s a really interesting observation. A potential next step could be to filter out everyone except that large Orange cluster and then identify clusters from within that group. This way, we might be able to identify the scikit-learn folks, but possibly also other communities focused around modules – pandas, nltk, etc…

  • DataJunkie here ;-). Awesome work. How many nodes and edges were in Gephi?

    The cluster breakdown is very interesting. It shows how different the west coast and east coast data science communities are in terms of interaction.

    Based on my interactions, I would label the orange cluster as “mathematical machine learning.” Those individuals are more interested in the theory than the rest of the graph in my opinion. They are also practitioners, but their tweets are more theoretical in nature rather than about tools etc.

  • giladlotan

    @Ryan – Its a fairly small graph. There were a total of 1053 nodes and 8937 edges.
    Regarding the orange cluster, that makes a lot of sense. As @Jake suggested above, might be interesting to further dissect only those users in the orange cluster -> effectively run a bunch of stats graphs only on that cluster to identify possible clusters from within that group. That could potentially help understand its main attributes/characteristics.

  • [...] Mapping Twitter’s Data Science Communities: very cool look at social network analysis using Twitter data [...]

  • [...] Mapping Twitter Python and Data Science Communities [...]

  • Chris Kang

    This is a really great. Thank you for the great lecture. Could you upload (to github) the actual Python code you used to map out the pythonists? More precisely, the code used to scrape from Twitter. Being a beginner Python coder, I am having hard time recreating your Pythonist graphml. (Gephi part was very self-explanatory). Thank you!

  • Amazing. Since I am a bridge in the graph – I am pretty sure the small sphere under me is French speaking pythonistas

  • Hey there, my best guess for the Orange cluster is also that it contains most of the international people in ML, like Mark Reid, Stwart, Mikio and Me, who are not necessarily very connected with sklearn.

  • Nice work with Gephi! And a goldmine of ideas and creativity! But your 4-keywords definition of “data-scientist” Twitter users seems unfortunately to be the exact opposite of a “data-science” approach (rich listing of all actors of this field, precise definition of the field itself, …). Could it be more than a fun exercise (with a better listing, i.e.)?

    • giladlotan

      Martin – totally.
      This was done as a fun experiment for a python tutorial.
      The next step would be to used these terms as “seeds” to identify part of the community. Then use the network signals to identify others who are core to the community, but do ont have those words in their bios (think highly interconnected nodes who are outside of the existing set of nodes). I wanted to add that as a next step but just didn’t have the time… Perhaps over the next week when there’s a bit more time.

  • [...] Mapping Twitter’s Python and Data Science Communities – a self referential demonstration of Python’s Networkx library and the open source graphing tool, Gephi, by mapping Twitter users who refer to Python in their bio. Neat. [...]

  • [...] A map of data scientists on Twitter.  Unfortunately, since we don’t have “data scientist” in our Twitter description, Simply Statistics does not appear. I’m sure we would have been central…. [...]

  • [...] work still matters in the virtual world that was supposed to kill geography, was also endorsed by Gilad Lotan’s mapping of interactions among data scientists on Twitter, finding distinct East Coast and West Coast [...]

  • My name [data_nerd] “was” mentioned, not for my 15 years with data analysis, my skills in programming and data cleansing, the fact that I have worked with data from Fortune 100 companies nor my 2 years talking about Data Science and Analytics on Twitter to educate anyone that is interested but for being NOISY :o) Nice analysis!

  • [...] Anyone knows how to fix it? I’m really a newbie to python and Gephi. I blog I referred when creating my code is giladlotan.com/blog/mapping-tw… [...]

  • [...] Mapping Twitter’s Data Science Community (Very cool!) August 29, 2013 · by Ted O'Brien · in Data Visualization, News Articles, Simply Interesting, Social Media. · By Gilad Lotan (formally of SocialFlow) on his very cool blog. [...]

Leave a Reply