What is n-Gram?
According to Wikipedia the “n-Gram viewer is a Phrase-usage graphing tool which charts the yearly count of selected n-grams (letter combinations)[n] or words and phrases, as found in over 5.2 million books digitized by Google Inc (up to 2008).”
The Reference URL:
I was curious on the following to understand the impact of the Tamil in Google digitization project.
Figure 1: Courtesy Google n-Grams
Some in scripts from the book belong to 1854 with the courtesy from Google digitization project.
Figure 2: Book digitized by Google from Jaffna, Srilanka Tamil to English
Astonished by the way the way digitization has been done and the way the text mining works. Awesome. Try your hands too.
What is it?
In simple terms retrieve quality information from the text for analysis.
Where it can be used?
- Analysis of emails, messages, etc.,
- Analysis of open-ended surveys
- Analysis of claims for fraud detection
- Investigation by crawling
- Spam filtering
- Labeling for Machine learning
- Recommendations engine
Various Stages of Text Mining:
Good tools for Text Mining (free J):
- R Programming (refer to the tm package)
- Gensim (Python library for analyzing plain text)
- Gate (Open Source library for Text Processing 15-Year old)
Where to get started: http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/