D3 Visualization Step by Step : Part 2

In general, if you have used jquery you might have got used to the concept of chaining. Since d3 extensively uses that knowledge on chaining would help. In addition you also need to learn little about the path element in html.

In the steps towards a simple line chart we would need the following:

a. Data [607,788,274,117,140,89,158,664] – No. of people killed in accidents in various cities till 2010 in TamilNadu, India
b. X-Axis
c. Y-Axis

The X-Axis will contain cities and the Y-Axis will contain the Number of people killed in the accident.

Step 1: Initially we will set the Height, width and margin of the graph

Step 2: Determine the scales using the d3.scale.linear() method for defining the scales of X-Axis and Y-axis. Linear scales are the most common scale, and a good default choice to map a continuous input domain to a continuous output range.

In our case we are providing input array as input to the domain and range methods, this will help set the values for y-axis.

var xAxis = d3.svg.axis()
.scale(x)
.orient(“bottom”);

var yAxis = d3.svg.axis()
.scale(y)
.orient(“left”)
.tickFormat(formatPercent);

Step 3: Now select the body element using the d3 method and append an SVG Graphic to the same. Height and width are passed as attribute to set the right width and height.

var svg = d3.select(“body”).append(“svg”)
.attr(“width”, width + margin.left + margin.right)
.attr(“height”, height + margin.top + margin.bottom)
.append(“g”)
.attr(“transform”, “translate(” + margin.left + “,” + margin.top + “)”);

Step 4: As we discussed now we are going to visualize the chart with the data using d3.tsv which loads data from a Tab Separated Value file

d3.tsv(“data.tsv”, type, function(error, data) {
x.domain(data.map(function(d) { return d.letter; }));
y.domain([0, d3.max(data, function(d) { return d.frequency; })]);

Step 5: Add all the components to the SVG area already defined as given below:

svg.append(“g”)
.attr(“class”, “x axis”)
.attr(“transform”, “translate(0,” + height + “)”)
.call(xAxis);

svg.append(“g”)
.attr(“class”, “y axis”)
.call(yAxis)
.append(“text”)
.attr(“transform”, “rotate(-90)”)
.attr(“y”, 6)
.attr(“dy”, “.71em”)
.style(“text-anchor”, “end”)
.text(“Frequency”);

svg.selectAll(“.bar”)
.data(data)
.enter().append(“rect”)
.attr(“class”, “bar”)
.attr(“x”, function(d) { return x(d.letter); })
.attr(“width”, x.rangeBand())
.attr(“y”, function(d) { return y(d.frequency); })
.attr(“height”, function(d) { return height – y(d.frequency); });

The final outcome is shown in the picture below:

Road Accidents till 2010 tamilnadu

This example has been done using d3 Library, HTML, CSS3 with WAMP. Certainly its complex for somebody to learn, I will try to evaluate the libraries d3 which can help improve the easy developing such graphs easily.

Advertisements

Visualization methods and tools: Powerful table

Many times we would have wondered when what table has to be used, what kind of diagram is appropriate for which situation. I came across an web page which gives a better perspective or answer to such question.

Periodic table of visualization methods

Periodic table of visualization methods

Courtesy: http://www.visual-literacy.org/periodic_table/periodic_table.html

This page has categorization of the content by Data Visualization, Strategy Visualization, Information Visualization, Metaphor Visualization, Concept visualization and compound visualization.

Though there is no hardbound rules to follow the same, it gives overview of which would be ideal to use in such scenarios. Even this page gives the sample of those visualization when you mouseover the table.

I believe it would be a very useful tool for many.

Thanks.

First d3.js Visualization: Step by Step-Road Accidents in cities by years 2010

Before you start?

You need to know what chart you need to draw.

What is D3?

D3.js is a Javascript Library developed by Mike Bostock. It’s a great tool for visualizing data.

D3 -> data, driven and documents.

Why would somebody would like to visualize data would need to learn these technologies?

Knowledge on HTML, CSS, JavaScript, JSON, PHP and Apache helps to build data visualization using d3.js You can develop your own visualization with a very fundamental knowledge on the same. You can go to w3school.com to get a good grip of these fundamentals.

Any d3 visualization would need the following 3 components:

What to present? How to present? Where to Present?

What to Present : Data: (Either through Json or passing data to Javascript)

How to Present: Structure: Using HTML & CSS

Where to Present: Layout: Specify how its to be generated using Javascript & D3.

Use case:

Now let’s try to depict the No. of people killed in accidents in important cities of the State TamilNadu in India by the year 2010.

Chennai City

607

ChennaiSuburban

788

Coimbatore

274

Madurai

117

Salem

140

Tirunelveli

89

Trichy City

158

Villupuram

858

Source: http://www.tn.gov.in/deptst/RoadAndTransport.pdf

Refer to the code available in : http://jsfiddle.net/seesiva/qjrZf/3/

We will see how to render the chart in my next post. Request your feedback. Thanks.

CrowdProcess: Possible future for analytics made simple in the making

You might have heard of initiatives like SETI which utilizes the existing Internet infrastructure for the purpose of having good computing capabilities. Crowdprocess startup based out of Portugal is an initiative based on Nodejs which helps to harness the power of unutilized or underutilized CPU cycles from the browsers.

It provides you more than 700 Gflops of computing capabilities with parallel computing abilities to you with ease. Though the technology is in beta mode the CrowdProcess team is very supportive and helpful in providing the necessary support. I believe in the evolving trends of big data need for such initiatives are most welcome and would attract lot of people to harness the existing resources. It’s being termed as Browser powered distributed computing system.

Though the system is new I have tried myself some very naïve javascript examples with Json data, it looks promising. The crowdprocess-cli I tried works very well with linux environment, though I got some issues with Windows.

The community is growing there, follow; your contributions would help us to make the system better.

Market Basket Analysis Retail Foodmart Example: Step by step using R

This post will be a small step by step implementation of Market Basket Analysis using Apriori Algorithm using R for better understanding of the implementation with R using a small dataset. This will also help to give detailed understanding of how simply we can use R for such purposes.

I’ve made the data from the foodmart dataset into this transaction set using the combination of the Time_Id and Customer_Id composite key. This will be a unique transaction Id which has been created as Trans_ID and incorporated Product name for easier understanding with the table name POS_Transcations. I have exported the data from this table as RetailFoodMartData.csv. This has 86829 records. For the sake of simplicity and quick understanding I have copied few data in transactions limited the number of rows to 105 and reexecuted the whole with RetailFoodMartDataTest.csv. The final result shown will be the output of RetailFoodMartDataTest.csv.

Before we move on to convert them into transaction to put to use in the Apriori algorithm we need to make sure there is no duplicates exists in the vector or data.frame. Otherwise you will get a error like “cannot coerce list with transactions with duplicated items”. So please remove the data from the CSV source file using Data->Remove Duplicates before you import data to R

Hope this would suffice for this exercise.

I’m using R version 3.0.1 for the analysis.

Data Preparation:

Step 1: Import Excel to the R environment. If you would like to know how to import, please refer to my blog post here.

Step 2: Please find the outcomes of the import steps and summary using R and you can find the top 5 records using head.

Step 3: In the above screenshot you can realize that first 6 items are belonging to the same transaction set, now our objective is to group or aggregate the items together based on the transaction id. We can do that using AggPosData<=split(RetailPosData$ProductName,RetailPosData$Trans_Id). This will aggregate the transactions with product name. In the example shown below it for Transaction ID 396 it shows 3 Products

Implementation of Association Rules Algorithm:

“Mining frequent item sets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases.” As the next step we would need to load the arules library to the RConsole.

Step 4: When you try to invoke the arules library packages if doesn’t exist it will show a error as shown in the picture, now you can install the package and load the library.

Step 5: Now we need to use the aggregate which has been done using split method.
We need to coerce the transaction for the purpose of Apriori algorithm to process the data we will do it as per the following: Txns<-as(AggPosData,”transactions”). This is being done with data which is aggregated in Step 3.

Now we will quickly review the Apriori Algorithm implementation in R with the picture which shows its process in a simplified manner:


Courtesy: http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput499/slides/Lect10/img054.jpg

Result of Summary (Txns)


In this example the summary provides the summary of the transactions as itemMatrix, this will be the input to the Apriori algorithm. In this example Atomic Bubble Gum with 6 occurrences.

Step 6: Now we will run the algorithm using the following statement:

Rules<-apriori(Txns,parameter=list(supp=0.05,conf=0.4,target=”rules”,minlen=2))

In the above obtained results it gives an understanding that if a customer buys Just Right Canned Yams there is 100% possibility that he might by Atomic Bubble Gum, similarly if a customer purchase CDR Hot Chocolate there is a possibility for him to buy either "Just Right Large Canned Shrimp" or "Atomic Bubble Gum". Confidence refers to the likelihood of the purchase and Support refers to the % of involvement in the transactions.

Step 7: Now we will decrease the confidence level to 0.2 and see the results given below, now the rules generated has increased, You can inspect the rules using Inspect(Rules) and you can specifically look at the rules using Inspect(Rules[1]):

Step 8: Now we will visualize the Top 10 items by frequency by using the following statement, itemFrequencyPlot(Txns, topN = 5)

 

Good references are available which have more steps in detail:

http://snowplowanalytics.com/analytics/catalog-analytics/market-basket-analysis-identifying-products-that-sell-well-together.html

http://prdeepakbabu.wordpress.com/2010/11/13/market-basket-analysisassociation-rule-mining-using-r-package-arules/

http://www.eecs.qmul.ac.uk/~christof/html/courses/ml4dm/week10-association-4pp.pdf

Though this is considered to be “Poor man recommendation engine” it’s a very useful one. In my next post we will continue to analyze how we can do this kind of analysis on large volume of data.

Market Basket Analysis with Hadoop: Importing mysql data to Hive using SQOOP

Now we have an existing data warehouse which is in MySql now we will need the following tables, which are Product and sales fact tables for the year 1997 and 1998. We will take steps to import this to HDFS for the further analysis using Hive. Please go through previous blog post of understanding how to establish connectivity with MySql using Sqoop.

We can start import using the following statement:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root

  • Now you can see that it has imported the data to HDFS in 50.2663 seconds which is at 3.0000 KB/sec. If you issue the command hadoop dfs –ls it will show a item added /user/hduser/product

Subsequent query with hadoop dfs –ls /user/hduser/product reveals the following:

Since we will use hive to analyze the data, we will import the data again to hive using –hive-import option, but if we do that the following sequence of things will happen:

  1. First step is the data will be imported to HDFS
  2. Sqoop generates hive scripts to load the data from the hdfs to hive.

So, we would need to remove the product folder which is imported to HDFS through the Sqoop as it will find the folder exists while its trying to import to hive. So we will remove the same using the following statement:

hadoop dfs -rmr /user/hduser/product

Now we will import the data using Sqoop using the hive option:

sqoop import –connect jdbc:mysql://192.168.1.10/foodmart –table product –username root –hive-import

Once the import is complete you will see something like the below:

Now we will go ahead and check the data in hive by using show tables and describe product:

In my next post we will import the remaining table to be used for market basket analysis and start querying with hive.

Why move data from RDBMS or existing data warehouse to HDFS?

In continuation to my previous post, many would have questions like the following?

  1. When there are traditional ways doing Market basket analysis with Cubes with Enterprise Data Warehouse (EDW) why do we need to adopt this route of moving data to HDFS?
  2. What kind of benefits will someone get by moving this RDBMS data to HDFS?
  3. Will it provide any cost savings?

This post is a mere analysis of possible use cases which could complement your data warehousing and analytics strategy. With the big hype around big data and related technologies its important to understand what to use and when to use accordingly. Good reasons for using Hadoop to complement the datawarehouse.

  1. Usage of tools like Sqoop and moving the data to an HDFS infrastructure environment will provide you the following benefits:
    1. Storage of extremely high volume data with the help of Hadoop infrastructure
    2. Accelerating the data movements with nightly batches with the help of MR Tasks.
    3. Automatic and redundant backup with the help of HDFS’s natural fault –tolerant data nodes.
    4. Low cost commodity hardware for scalability.
  2. Movement of the structured data to the HDFS enable to analyze the data in relationship with the unstructured data or semi structured data such as tweets, blogs, etc.,
  3. Not necessary to model the data as it can be handled while it’s being read.
  4. This also provides you capabilities to do quick exploratory analytics before moving to the warehouse for final analysis.

You can look at the following papers for more information and detailed understanding of the same:

http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Using_Cloudera_to_Improve_Data_Processing_Whitepaper.pdf

http://www.b-eye-network.com/blogs/eckerson/archives/2013/02/what_is_unique.php

http://www.teradata.com/white-papers/Hadoop-and-the-Data-Warehouse-When-to-Use-Which/?type=WP