This is a continuation of this blog, where I tried to find the most active member of a whatsApp group. Interestingly it turned out to be myself, and I decided to find out the most frequent words used by me. First objective was to filter out only the messages typed by me, and grep is my saviour here.
$cat chat.txt | grep Sreedish
19 Apr 11:32 pm - Sreedish: how long is the course duration
19 Apr 11:32 pm - Sreedish: and timing of classes?
19 Apr 11:33 pm - Sreedish: monday to sunday ?
But this resulted in some unwanted lines where the other members of the group referred to my name, like this:
16 Apr 10:37 pm - Aravind S Chennai: Sreedish is driving
The next objective was to get the messages typed by name. So it needed some more extra parsing of the text. I made use of the fact that the sender name will be always caught in between a hyphen and a semicolon like this "- Sreedish:" after the time stamp.
$cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{if($1 ~ "Sreedish")print $2}'
planning to attend regularly?
how long is the duration?
So splitting the messages by a hyphen and semicolon and making a "if" check for my name was giving me the perfect result that I wanted. I actually verified it by printing $1 instead of $2 in the if block.
Next task was to tokenise the sentences to words, so that I can count the occurrences. I had to google to figure out how this can be done, and came across this translate command in Unix. I had seen people using this, but I never used it before. A quick read through man page of "tr" gave me an idea of how powerful this utility is. "tr" copies the standard input to the standard output with substitution or deletion of selected characters. $cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{if($1 ~ "Sreedish")print $2}' | tr -cs '[:alnum:]' '\n'
The good this with this command was that, it also stripped out the special characters and injected only the alphanumeric characters into our analysis. All that what remaining was to make a count. and here we go.
But the result still had some problem. 62 times 'i' and 444 times 'I'
62 i
444 I
And at this point i realised the words are coming case sensitive. So one more step to be done to make the tokenised words all to small case.
cat chat.txt | awk -F '-' '{print $2}' |
awk -F ':' '{if($1 ~ "Sreedish")print $2}' |
tr -cs '[:alnum:]' '\n'|tr '[:upper:]' '[:lower:]' |
sort | uniq -c | sort
And it gave me this "506 i" and this was exactly I wanted. And the top words of total word count were
137 will 140 are 149 of 164 it 235 for 236 jith 241 a 242 in 263 the 272 to 295 and 342 is 410 you 506 i
Most of these words are stop words and can be easily filtered out by a "grep -v", which should give more meaningful insights about the frequent words.
ps: Another interesting thing that can be done on top of this is stemming and Porter stemmer is one very easy stemmer.