Wednesday, April 22, 2015

Whatsapp group chat - Data Analysis

I am part of a very active whatsapp group chat with my college friends, and I got interested in generating some statistics about the chat. So here we go. 

Initially i exported the whatsapp chat history(I only had chat history starting from 2015 Februray 9, as  I switched to a new phone on that day) into my MacBook by using the "email conversation" from the group chat "More" tab. So I got it in my mailbox, and downloaded the attached txt file into my laptop. I opted to avoid the media files, and was interested only in the text messages.

This is what I have now :
 [sreedish.ps@~/Downloads$]cat chat.txt | head  
 9 Feb 10:28 pm - ‪+91 99160 54737‬ created group “Ooty Pattanam”  
 9 Feb 10:28 pm - You were added  
 11 Feb 7:32 pm - Sreedish: I lost all my what's app history  
 11 Feb 7:32 pm - Sreedish: Changed my phone  
 11 Feb 7:51 pm - Nithin Mbt: No backup of mobile possible ?  
 11 Feb 7:54 pm - Sreedish: Gallery and contacts restored  
 11 Feb 7:54 pm - Sreedish: But not chat history  
 11 Feb 8:04 pm - Nithin Mbt: Umm  
 11 Feb 8:09 pm - Anoop Mbt: Which phone?  
 11 Feb 9:35 pm - Sreejith Mohan:   

Two unix commands I used cat and head. Cat will print the contents of the file into stdout, I piped it into the head command, which will print only the top ten lines. 

My first attempt was to find out who is the most active member in the group chat, and for that I needed to count the number of messages typed by each member, sort it, and get the guy with the most number of messages. I observed a nice format in the messages, the messages were of the format

 "date month time - sender:message"  

So inorder to get the sender name, i should strip out whatever is in between "-"(hyphen) and ":" (colon). 

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | head  
  ‪+91 99160 54737‬ created group “Ooty Pattanam”  
  You were added  
  Sreedish: I lost all my what's app history  
  Sreedish: Changed my phone  
  Nithin Mbt: No backup of mobile possible ?  
  Sreedish: Gallery and contacts restored  
  Sreedish: But not chat history  
  Nithin Mbt: Umm  
  Anoop Mbt: Which phone?   

I used the powerful and my favourite awk to do this. the command was 
 cat chat.txt | awk -F '-' '{print $2}' | head  

which means, cat it to stdout, pipe it to awk. Awk splits a sentence into words, and the default delimiter is space. But by using " -F '-' " , I am telling the Awk compiler to use hyphen as the delimiter instead of space. '{print $2}' means, after splitting using hyphen as a delimiter, print the second field. 
Eg: assume this is the line "11 Feb 7:32 pm - Sreedish: Changed my phone". So after splitting as hyphen as delimiter
$1 = 11 Feb 7:32 pm
$2 = Sreedish: Changed my phone
And i wanted $2, because it contains the sender name. I used a head because, i didn't want to flood my terminal. 
Now I based on colon, I will strip out only the name of the sender. 

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | head  
  ‪+91 99160 54737‬ created group “Ooty Pattanam”  
  You were added  
  Sreedish  
  Sreedish  
  Nithin Mbt  
  Sreedish  
  Sreedish  
  Nithin Mbt  
  Anoop Mbt  
  Sreejith Mohan  

The command is 
 cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | head  

I piped the output of first AWK to the second AWK which uses ':' as the delimiter, and this time i wanted $1 as the name of the sender was preceding the delimiter. Now I stripped out only the sender names, an all I have to do is a sort of them and make a count. 

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | sort | uniq -c | sort -r | head -14  
 3093 Sreedish  
 2285 Aravind S Chennai  
 2104 Kk Bangalore  
 1527 Sreejith Mohan  
  959 Keeru Unname  
  713 ‪KK US  
  688 Rahul Raghavan  
  629 Nithin Mbt  
  428 Rajesh Babu Nit  
  182 Anoop Mbt  
  70 Shekar  
  43 Jyothi  
  37 Suman  
  34 George  

Command used is

[sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | sort | uniq -c | sort -r | head -14  
in which I used a unix sort piping to a  uniq  -c (-c will give me the count) and piping it again to a sort -r (reverse sort). 
It turned out that I was the most active member in the group beating everybody else. :) 

7 comments:

  1. Cool.... Great attempt and pretty interesting bro :)

    ReplyDelete
  2. Really cool, can you make some software combined with all your observations. Where we only have to insert the text file extracted from Whatsapp and it gives out the desired results.

    ReplyDelete
    Replies
    1. Try this App :) .It will get you all analysis
      https://play.google.com/store/apps/details?id=com.apps.vsworks.wacanalyzer

      Delete
  3. Right now I am only doing it manually with the help of Excel

    ReplyDelete
  4. Really cool, can you make some software combined with all your observations. Where we only have to insert the text file extracted from Whatsapp and it gives out the desired results.

    ReplyDelete
  5. if you are any one intersted for data analytics and data science.can you add whats up group

    ReplyDelete