Social Network Analysis: Calculating Degree with Gephi

In continuation to my earlier post on creating a Simple social network analysis using Gephi on the simple friends network. We will focus on the terminology “Degree” related to the Social Network Analysis in this post. In the previous example I used to Edge List in a CSV format to import the data to Gephi to obtain this social network which is very simple.


Degree is the edges incident on a node. According to Wikipedia, The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes.

In this outcome are social network diagram you can see Siva has more number of Degrees which can be looked from the gephi as given in below given screenshot.

This shows the following attributes of Siva:

In-Degree: 4 (Vijay, Gopikrishna, Aditya, Kumar) – Head End Points at the Node

Out-Degree:7 (Ilango, Ramesh, Kannan, Aditya, Kumar, Vijay and GopiKrishna) – Tail End points at the Node

Degree: 11

According to Wikipedia, For a node, the number of head endpoints adjacent to a node is called the indegree of the node and the number of tail endpoints adjacent to a node is its outdegree.

Degree Distribution:

The degree distribution displays the count of nodes with the appropriate distribution. This example there are 3 nodes with the degree of 4 and one node with degree 1.


Step by Step Social Network Analysis using Gephi: Getting Started

In continuation to my previous blog post on Social Network Analysis using Gephi, I’m writing this post to explain how do create a very simple social network analysis using Gephi. You can also look at a very good introduction to Gephi written by Martin Grandjean here

Goal and Scenario:

We have a friends network we want to depict visually how the friends are interconnected with each other. The goal is to understand how to use Gephi Step by step along with having very fundamental understanding of how the data is represented.


* You would need the Gephi software which you can download from here.

* Data to be imported

* Fundamental understanding of what is a Graph, Node and Edge. (Please read more here ( To understand more visually please refer to this link.

Step by step Instructions:

Step 1: After you install Gephi, you will see a screen like this. Click New Project.

Step 2: In this example we are going to import that data from CSV files and we are going to use them for ease of use. Once you click New Project you will get the following Screen, then click Data Laboratory for importing data.

Step 3: Once you Open the Data Laboratory pane now you click Import Spreadsheet. First import the as table: “Nodes table” with browsing the Friends.csv. Then click next and Finish.

Which will result like the following once you click the finish button.

Step 4: Now you again click Import Spreadsheet. First import the as table: “Edges table” with browsing the Edges.csv. Then click next and Finish.

The results will be available once you click on the Edges in the Data Table as given below:

Steps 5: Now when you click the overview button right below the toolbar you can see the following network diagram created.

Red highlighted portion tells us that it has 8 Nodes and 15 Edges.

Step 6: We can use the Layouts to make it look little better using the Force Atlas 2 and Label Adjust to look clean and better.

We will try to get into more details in the next post with a different example.

Social Network Analysis using Gephi

When I came across a course on Social Network Analysis I came across Gephi. This post is a very simple attempt to give a very simple introduction of Social Network Analysis.

What is Social Network Analysis?

According to Wiki in simple terms it is analysis of social networks. Rather I would say it helps to analyze any data network. The key things which would be analyzed are is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other attributes. For example the given figure visually represents relationship between people.

In the future posts we will see more on using Gephi for Data Visualization for Social Network Analysis. Please read through a very good introduction on this topic and various other articles by Martin Grandjean on the subject.

What is Gephi?

According to “Gephi” is a Open source data visualization tool or all kinds of networks and complex systems, dynamic and hierarchical graphs.


Now we will have simple example of 8 friends how they are networked.


Here the nodes are the friends namely :











Each Id will have a relationship with another Id. Let’s see how they are linked based on how they have been closely related by introduction:









So to handle this with Gephi we can do it with the help of a CSV File, so we need two files which will be imported to Gephi. In my next post we will how do we do this step by step with Gephi.

Step by Step to access wikipedia dump using MongoDB and PHP for data analytics

In an effort to evaluate how to handle unstructured data using mongodb I write this post to extract data from WikiDump and importing to Mongodb and accessing the same using PHP MongoDB Client libraries. Subsequently to do behavior analytics on the data.

Objective: Retrieve the wikidump import it into mongodb and retrieve the data using mongoclient.

Pre-Requisites: We would need the following softwares are pre-requisites for going step by step as per the post.

Step by Step Procedure: Importing the Tamil Wikipedia Dump to MongoDb.

Step 1: First you can download the wikipedia dump from the website In this case I downloaded the Tamil wiki dump from the following location

Step 2: Then I would like to analyze the articles in the Tamil wikipedia so downloaded the file tawiki-20131204-pages-articles.xml.bz2 which is of 68.5 MB size

Step 3: You would also need the appropriate library for handling mongoclient and bzopen libraries. Please download those and use it accordingly.

Step 3a: MongoClient Library installation: and look at the s3 amazon link location I downloaded

Note: Extract that zip file and download appropriate version dll to the ext folder of php location. Make sure to add extension=php_mongo-1.4.5-5.4-vc9-x86_64.dll to php.ini ( I was running this in 64-bit os, choose appropariately)

Step 3b: BZip Installation: and refer this link for setting up of the BZip library in PHP.


a. BZip2.dll has to be copied from the BZip installation folder to php folder, refer the picture given below).

b. Make sure to add extension=php_bz2.dll to php.ini or uncomment the same if it exists.

Step 4: Make sure you are running mongodb using the command line interface as shown in the picture below:

Step 5: Enough of setup, now it’s time to import the data from the wikidump to the mongodb. Please follow the instructions on executing the PHP file for importing from the command line. We will use the PHP code from James Linden from the URL provided here

Download the PHP script and place it under WAMP/XAMPP folder accordingly. Make sure you change the $dsname = ‘mongodb://localhost/wp20130708’;
$file = ‘enwiki-20130708-pages-articles.xml.bz2’;
$log = ‘./’;

aspects in the PHP Script. Then go to the command prompt and give the command PHP

I’m running the file from PHPScript folder but executing the PHP bin folder. This will also create a log file while the execution is complete.

Step 6: We will verify the imported data using the client mongo

Step 7: The following PHP Code helps to connect to the mongodb which has the wikidump

The following is the code which has modified:


<meta http-equiv=”content-type” content=”text/html;charset=utf-8″ />


// PHP code to look at the Wikipedia data which are imported from Wikipedia Dump to MongoDB

$dbhost = ‘mongodb://localhost/tawiki’;

$dbname = “tawiki”;

//Connect to the localhost

$m = new mongoclient($dbhost);

//Retrive the collections exists in the database “tawiki”

$db = $m->selectDB(“tawiki”);


//Display the Collections

echo (“<b>Collections in </b>” . $dbname . “<br/><hr/>”);


//Lets retrive the data in the collection “Page”

$collection=new MongoCollection($db, ‘page’);

//$Query =array();//(“username” => “Seesiva”);

$Query =array(“revision.contributor.username”=>”Seesiva”);

$cursor = $collection->find($Query);



echo (“<b>Titles Contribution for the Query: </b>” . array_shift($Query) . ” and total results found:” . strval($mycount) .”<br/><hr/>”);

//Retrieve the titles contributed by the User

foreach ($cursor as $doc) {







Step 8: Find the results

Cohort Analytics – Solving analytical problems on segmented data

Cohort Analytics

What does the term “Cohort” means?

“Cohort” means a group of people sharing common characteristics over a certain period of time.


* Cohort of people who have been diagnosed for “Diabetes” during the year 2013

* Cohort of students who has spent more than 5 days for Green cause during 2001-2004

* Cohort of users in Wikipedia who has stayed more than 5 years and contributed 100 edits every month

Applications or Uses of Cohort Analytics or Cohort Analysis:

* Segmented cohorts helps to refine and focus on the problem under question.

* Helps to identify the trend or pattern over a period of time.

* It provides or paves way to analyze a Customer with different time periods a cohorts and helps to analyze

Stages of Cohort Analytics:


In the next post we will look in detail about the cohort analysis with an example.

Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – III

After lot of difficulties my 3rd post on this topic in this weekend. In my first post we saw what is sentiment analysis and what are the steps involved in it. In my previous post we saw how to retrieve the tweets and store it in the File step by step. Now we will move on to the step of Sentiment analysis.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets which are retrieved and stored in the previous post and start doing the analysis. In this post I’m going to use the simple algorithm as used by Jeffrey Breen to determine the scores/moods of the particular brand in twitter.

We will use the opinion lexicon provided by him which is primarily based on Hu and Liu papers. You can visit their site for lot of useful information on sentiment analysis. We can determine the positive and negative words in the tweets, based on which scoring will happen.

Step 1: We will import the CSV file into R using read.csv and you can use the summary to display the summary of the dataframe.

Step 2: We can load the Positive words and Negative words and store it locally and can import using Scan function as given below:

Step 3:

Now we will look at the code for evaluating the Sentiment. This has been taken from Thanks for the source code by Jeffrey.

Step 4:

We will test this sentiment.score function with some sample data.

In this step we have created test and added 3 sentences to it. This contains different words which may be positive or negative. Pass this “Test” to the score.sentiment function with pos_words and neg_words which we have loaded in the previous tests. Now you get the result score from the score.sentiment function against each sentence.

we will also try understand little more about this function and what it does:

a. Two libraries are loaded they are plyr and stringr. Both written by Hadley Wickham one of the great contributor to R. You can also learn more about plyr using this page or tutorial. You can also get more insights on split-apply-combine details here best place to start according to Hadley Wickham. You can think of it on analogy with Map-Reduce algorithm by Google which is used more in terms of Parallelism. stringr makes the string handling easier.

b. Next laply being used. You can learn more on what apply functions do here. In our case we pass on the sentences vector to the laply method. In simple terms this method takes each tweet and pass on to the function along with Positive and negative words and combines the result.

c. Next gsub helps to handle the replacements with the help using gsub(pattern, replacement, x).

d. Then convert the sentence to lowercase

e. Convert the sentences to words using the split methods and retrieve the appropriate scores using score methods.

Step 5: Now we will give the tweetsofaritel from airteltweetdata$text to the sentiment function to retrieve the score.

Step 6: We will see the summary of the scores and its histogram:

The histogram outcome:

It shows the most of the response out of 1499 is negative about airtel.

Disclaimer: Please note that this is only sample data which is analyzed only for the purpose of educational and learning purpose. It’s not to target any brand or influence any brand.

Step by Step Sentiment analysis on Twitter data using R with Airtel Tweets: Part – II

In the previous post we saw what is sentiment analysis and what are the steps involved in it. In this post we will go through step by step instruction on doing Sentiment Analysis on the Micro blogging site “Twitter”. We will have specific objective to do so. I came across an interesting post by Chetan S on the DTH operators involvement in using Social Media for providing customer support. It triggered me the idea for this post.

Goal: To do sentiment analysis on Airtel Customer support via Twitter in India.

In this Post: We will retrieve the Tweets, look at how to access the Twitter API and make best use of the TwitteR R package and write these tweets to a file.

Important Note:

1. when you would like to use the searchTwitter, go to and your application go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to click on the save button after doing this.”

Refer to this link

2. When you are trying to search using searchTwitter after the above step if you get ssl problem make sure you have enable rCurl and do the steps outline here:

options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))) also make sure you have loaded the Necessary Packages like ROAuth,

Step 1: Make sure you have done the OAuth authentication with Twitter using the Previous post and the steps outlined above, you can also check the library loaded with sessionInfo(). Step 2: Make sure you load the tweets from the Twitter from the Twitter Handle accordingly > airtel.tweets=searchTwitter(“@airtel_presence”,n=1500) Now we have loaded the 1499 tweets which was responded by the Twitter API in to airtel.tweets. Now what we will do is to save these to a file for future processing. Step 3: Before we write these tweets to a file, for better understanding we will try to look at some of the tweets and data collected so far. head(airtel.tweets) provides the top 6 tweets. Further to our analysis, we try to get the length of the tweets, what kind of class it is and how can we access the tweets. Look at the below given screenshot. Step 4: We will look at some examples of How to access the twitter data in a better fashion with respect to the Twitter API using TwitteR library by accessing one tweets from the 1499 available. In this above given example we have selected the 3rd item from the list and we have tried to get till the user information, how many friends he has and how many followers he has, etc., These are the things which are vital to understand as these factors can become viral and impact the image of a particular brand. Now will go to the next step of identifying the steps to store these tweets for further analysis. Step 5: We will store these tweets we collected in airtel.tweets to a file for future analysis and reference. We are going to convert the list of tweets to separate data using apply functions and write to a file. We are going to use the library plyr for the same. Plyr allows the user to split a data set apart into smaller subsets, apply methods to the subsets, and combine the results. Please click here for detailed introduction on plyr. So we are converting the list to data frame for preparing it to be written to a file. Now the tweets and all the necessary information is available in the tweets.df data frame. You can look at the below screenshots for its summary. Step 6: Setup the Working directory and write the tweets.df data frame to the file airteltweets.csv. You can verify the data available in this file using Notepad++ or Excel In the next post we will look at how to do sentiment analysis with this file data.