Text Mining: Google n-Gram Viewer & the word “Tamil”

What is n-Gram?

According to Wikipedia the “n-Gram viewer is a Phrase-usage graphing tool which charts the yearly count of selected n-grams (letter combinations)[n] or words and phrases, as found in over 5.2 million books digitized by Google Inc (up to 2008).”

The Reference URL:

http://books.google.com/ngrams/

My Experiment:

http://books.google.com/ngrams/graph?content=Tamil&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=

I was curious on the following to understand the impact of the Tamil in Google digitization project.

Figure 1: Courtesy Google n-Grams

 

Some in scripts from the book belong to 1854 with the courtesy from Google digitization project.

 

Figure 2: Book digitized by Google from Jaffna, Srilanka Tamil to English

 

Astonished by the way the way digitization has been done and the way the text mining works. Awesome. Try your hands too.

Text Mining: Intro, Tools and References

What is it?

In simple terms retrieve quality information from the text for analysis.

Where it can be used?

  1. Analysis of emails, messages, etc.,
  2. Analysis of open-ended surveys
  3. Analysis of claims for fraud detection
  4. Investigation by crawling
  5. Spam filtering
  6. Labeling for Machine learning
  7. Recommendations engine

Various Stages of Text Mining:

Good tools for Text Mining (free J):

  • R Programming (refer to the tm package)
  • Gensim (Python library for analyzing plain text)
  • Gate (Open Source library for Text Processing 15-Year old)

Good References:

Where to get started: http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/

http://www.statsoft.com/textbook/text-mining/

http://rapid-i.com/component/option,com_myblog/show,Great-Video-Series-about-Text-Mining.html/Itemid,172/