What I Learned Today: May 2015

This is a continuation of this blog, where I tried to find the most active member of a whatsApp group. Interestingly it turned out to be myself, and I decided to find out the most frequent words used by me. First objective was to filter out only the messages typed by me, and grep is my saviour here.

$cat chat.txt | grep Sreedish  
 19 Apr 11:32 pm - Sreedish: how long is the course duration  
 19 Apr 11:32 pm - Sreedish: and timing of classes?  
 19 Apr 11:33 pm - Sreedish: monday to sunday ?

But this resulted in some unwanted lines where the other members of the group referred to my name, like this:

16 Apr 10:37 pm - Aravind S Chennai: Sreedish is driving

The next objective was to get the messages typed by name. So it needed some more extra parsing of the text. I made use of the fact that the sender name will be always caught in between a hyphen and a semicolon like this "- Sreedish:" after the time stamp.

 $cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{if($1 ~ "Sreedish")print $2}'  
 planning to attend regularly?  
 how long is the duration?

So splitting the messages by a hyphen and semicolon and making a "if" check for my name was giving me the perfect result that I wanted. I actually verified it by printing $1 instead of $2 in the if block.

Next task was to tokenise the sentences to words, so that I can count the occurrences. I had to google to figure out how this can be done, and came across this translate command in Unix. I had seen people using this, but I never used it before. A quick read through man page of "tr" gave me an idea of how powerful this utility is. "tr" copies the standard input to the standard output with substitution or deletion of selected characters.

 $cat chat.txt | awk -F '-' '{print $2}' |   
 awk -F ':' '{if($1 ~ "Sreedish")print $2}' |   
 tr -cs '[:alnum:]' '\n'

The good this with this command was that, it also stripped out the special characters and injected only the alphanumeric characters into our analysis. All that what remaining was to make a count. and here we go.

But the result still had some problem. 62 times 'i' and 444 times 'I'

62 i

444 I

And at this point i realised the words are coming case sensitive. So one more step to be done to make the tokenised words all to small case.

 cat chat.txt | awk -F '-' '{print $2}' |   
 awk -F ':' '{if($1 ~ "Sreedish")print $2}' |   
 tr -cs '[:alnum:]' '\n'|tr '[:upper:]' '[:lower:]' |   
 sort | uniq -c | sort

And it gave me this "506 i" and this was exactly I wanted. And the top words of total word count were


  137 will   
  140 are   
  149 of   
  164 it   
  235 for   
  236 jith   
  241 a   
  242 in   
  263 the   
  272 to   
  295 and   
  342 is   
  410 you   
  506 i

Most of these words are stop words and can be easily filtered out by a "grep -v", which should give more meaningful insights about the frequent words.

ps: Another interesting thing that can be done on top of this is stemming and Porter stemmer is one very easy stemmer.

What I Learned Today

Saturday, May 16, 2015

Whatsapp group chat - More Data Analysis