What I Learned Today: 2015

Saturday, May 16, 2015

Whatsapp group chat - More Data Analysis

This is a continuation of this blog, where I tried to find the most active member of a whatsApp group. Interestingly it turned out to be myself, and I decided to find out the most frequent words used by me. First objective was to filter out only the messages typed by me, and grep is my saviour here.

$cat chat.txt | grep Sreedish  
 19 Apr 11:32 pm - Sreedish: how long is the course duration  
 19 Apr 11:32 pm - Sreedish: and timing of classes?  
 19 Apr 11:33 pm - Sreedish: monday to sunday ?

But this resulted in some unwanted lines where the other members of the group referred to my name, like this:

16 Apr 10:37 pm - Aravind S Chennai: Sreedish is driving

The next objective was to get the messages typed by name. So it needed some more extra parsing of the text. I made use of the fact that the sender name will be always caught in between a hyphen and a semicolon like this "- Sreedish:" after the time stamp.

 $cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{if($1 ~ "Sreedish")print $2}'  
 planning to attend regularly?  
 how long is the duration?

So splitting the messages by a hyphen and semicolon and making a "if" check for my name was giving me the perfect result that I wanted. I actually verified it by printing $1 instead of $2 in the if block.

Next task was to tokenise the sentences to words, so that I can count the occurrences. I had to google to figure out how this can be done, and came across this translate command in Unix. I had seen people using this, but I never used it before. A quick read through man page of "tr" gave me an idea of how powerful this utility is. "tr" copies the standard input to the standard output with substitution or deletion of selected characters.

 $cat chat.txt | awk -F '-' '{print $2}' |   
 awk -F ':' '{if($1 ~ "Sreedish")print $2}' |   
 tr -cs '[:alnum:]' '\n'

The good this with this command was that, it also stripped out the special characters and injected only the alphanumeric characters into our analysis. All that what remaining was to make a count. and here we go.

But the result still had some problem. 62 times 'i' and 444 times 'I'

62 i

444 I

And at this point i realised the words are coming case sensitive. So one more step to be done to make the tokenised words all to small case.

 cat chat.txt | awk -F '-' '{print $2}' |   
 awk -F ':' '{if($1 ~ "Sreedish")print $2}' |   
 tr -cs '[:alnum:]' '\n'|tr '[:upper:]' '[:lower:]' |   
 sort | uniq -c | sort

And it gave me this "506 i" and this was exactly I wanted. And the top words of total word count were


  137 will   
  140 are   
  149 of   
  164 it   
  235 for   
  236 jith   
  241 a   
  242 in   
  263 the   
  272 to   
  295 and   
  342 is   
  410 you   
  506 i

Most of these words are stop words and can be easily filtered out by a "grep -v", which should give more meaningful insights about the frequent words.

ps: Another interesting thing that can be done on top of this is stemming and Porter stemmer is one very easy stemmer.

Wednesday, April 22, 2015

Whatsapp group chat - Data Analysis

I am part of a very active whatsapp group chat with my college friends, and I got interested in generating some statistics about the chat. So here we go.

Initially i exported the whatsapp chat history(I only had chat history starting from 2015 Februray 9, as I switched to a new phone on that day) into my MacBook by using the "email conversation" from the group chat "More" tab. So I got it in my mailbox, and downloaded the attached txt file into my laptop. I opted to avoid the media files, and was interested only in the text messages.

This is what I have now :

 [sreedish.ps@~/Downloads$]cat chat.txt | head  
 9 Feb 10:28 pm - ‪+91 99160 54737‬ created group “Ooty Pattanam”  
 9 Feb 10:28 pm - You were added  
 11 Feb 7:32 pm - Sreedish: I lost all my what's app history  
 11 Feb 7:32 pm - Sreedish: Changed my phone  
 11 Feb 7:51 pm - Nithin Mbt: No backup of mobile possible ?  
 11 Feb 7:54 pm - Sreedish: Gallery and contacts restored  
 11 Feb 7:54 pm - Sreedish: But not chat history  
 11 Feb 8:04 pm - Nithin Mbt: Umm  
 11 Feb 8:09 pm - Anoop Mbt: Which phone?  
 11 Feb 9:35 pm - Sreejith Mohan:

Two unix commands I used cat and head. Cat will print the contents of the file into stdout, I piped it into the head command, which will print only the top ten lines.

My first attempt was to find out who is the most active member in the group chat, and for that I needed to count the number of messages typed by each member, sort it, and get the guy with the most number of messages. I observed a nice format in the messages, the messages were of the format

 "date month time - sender:message"

So inorder to get the sender name, i should strip out whatever is in between "-"(hyphen) and ":" (colon).

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | head  
  ‪+91 99160 54737‬ created group “Ooty Pattanam”  
  You were added  
  Sreedish: I lost all my what's app history  
  Sreedish: Changed my phone  
  Nithin Mbt: No backup of mobile possible ?  
  Sreedish: Gallery and contacts restored  
  Sreedish: But not chat history  
  Nithin Mbt: Umm  
  Anoop Mbt: Which phone?

I used the powerful and my favourite awk to do this. the command was

 cat chat.txt | awk -F '-' '{print $2}' | head

which means, cat it to stdout, pipe it to awk. Awk splits a sentence into words, and the default delimiter is space. But by using " -F '-' " , I am telling the Awk compiler to use hyphen as the delimiter instead of space. '{print $2}' means, after splitting using hyphen as a delimiter, print the second field.

Eg: assume this is the line "11 Feb 7:32 pm - Sreedish: Changed my phone". So after splitting as hyphen as delimiter

$1 = 11 Feb 7:32 pm

$2 = Sreedish: Changed my phone

And i wanted $2, because it contains the sender name. I used a head because, i didn't want to flood my terminal.

Now I based on colon, I will strip out only the name of the sender.

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | head  
  ‪+91 99160 54737‬ created group “Ooty Pattanam”  
  You were added  
  Sreedish  
  Sreedish  
  Nithin Mbt  
  Sreedish  
  Sreedish  
  Nithin Mbt  
  Anoop Mbt  
  Sreejith Mohan

The command is

 cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | head

I piped the output of first AWK to the second AWK which uses ':' as the delimiter, and this time i wanted $1 as the name of the sender was preceding the delimiter. Now I stripped out only the sender names, an all I have to do is a sort of them and make a count.

 [sreedish.ps@~/Downloads$]cat chat.txt | awk -F '-' '{print $2}' | awk -F ':' '{print $1}' | sort | uniq -c | sort -r | head -14  
 3093 Sreedish  
 2285 Aravind S Chennai  
 2104 Kk Bangalore  
 1527 Sreejith Mohan  
  959 Keeru Unname  
  713 ‪KK US  
  688 Rahul Raghavan  
  629 Nithin Mbt  
  428 Rajesh Babu Nit  
  182 Anoop Mbt  
  70 Shekar  
  43 Jyothi  
  37 Suman  
  34 George

Command used is

Thursday, April 9, 2015

Java NIO ByteBuffer Experiments

Me and Sunil Kalva stumbled upon an exciting use case for Java NIO ByteBuffers. Sunil started experimenting using the compiler, and I started reading on it. Java NIO documentations were not that great and found an excellent tutorial. Please go through the link and understand the bytebuffer jargon like capacity, position, limit, compact, clear etc before trying out this example.

Code:

 import java.nio.ByteBuffer;  
 public class test {  
  public static void main(String[] args) {  
   ByteBuffer byteBuffer = ByteBuffer.allocate(10);  
   print(byteBuffer, "ByteBuffer Created");  
   byteBuffer.put("ABCD".getBytes());  
   print(byteBuffer, "After adding data");  
   byteBuffer.flip();  
   print(byteBuffer, "After Read Flip");  
   System.out.println("Reading data from byte buffer");  
   while(byteBuffer.hasRemaining()){  
    System.out.print((char) byteBuffer.get());  
   }  
   //spacer  
   System.out.println();  
   print(byteBuffer, "After Reading ");  
   byteBuffer.rewind();  
   print(byteBuffer, "After rewind");  
   byteBuffer.compact();  
   print(byteBuffer, "After Compact");  
   byteBuffer.put("EFGHIJ".getBytes());  
   print(byteBuffer, "After Adding More");  
   byteBuffer.clear();  
   print(byteBuffer, "After clear");  
   byteBuffer.get(new byte[2]);  
   print(byteBuffer, "After Reading 2 bytes");  
   byteBuffer.compact();  
   print(byteBuffer, "After compact");  
   byteBuffer.put("LM".getBytes());  
   print(byteBuffer, "After Writing 2 bytes");  
  }  
  private static void print(ByteBuffer byteBuffer, String message) {  
   System.out.println("\n========= " + message + " ========= " );  
   System.out.println("Content  = "+new String(byteBuffer.array()));  
   System.out.println("Position = " + byteBuffer.position());  
   System.out.println("Limit   = " + byteBuffer.limit());  
   System.out.println("Remaining = " + byteBuffer.remaining());  
   System.out.println("======================================");  
   System.out.println();  
  }  
 }

Output:

 ========= ByteBuffer Created =========   
 Content  =   
 Position = 0  
 Limit   = 10  
 Remaining = 10  
 ======================================  
 ========= After adding data =========   
 Content  = ABCD  
 Position = 4  
 Limit   = 10  
 Remaining = 6  
 ======================================  
 ========= After Read Flip =========   
 Content  = ABCD  
 Position = 0  
 Limit   = 4  
 Remaining = 4  
 ======================================  
 Reading data from byte buffer  
 ABCD  
 ========= After Reading =========   
 Content  = ABCD  
 Position = 4  
 Limit   = 4  
 Remaining = 0  
 ======================================  
 ========= After rewind =========   
 Content  = ABCD  
 Position = 0  
 Limit   = 4  
 Remaining = 4  
 ======================================  
 ========= After Compact =========   
 Content  = ABCD  
 Position = 4  
 Limit   = 10  
 Remaining = 6  
 ======================================  
 ========= After Adding More =========   
 Content  = ABCDEFGHIJ  
 Position = 10  
 Limit   = 10  
 Remaining = 0  
 ======================================  
 ========= After clear =========   
 Content  = ABCDEFGHIJ  
 Position = 0  
 Limit   = 10  
 Remaining = 10  
 ======================================  
 ========= After Reading 2 bytes =========   
 Content  = ABCDEFGHIJ  
 Position = 2  
 Limit   = 10  
 Remaining = 8  
 ======================================  
 ========= After compact =========   
 Content  = CDEFGHIJIJ  
 Position = 8  
 Limit   = 10  
 Remaining = 2  
 ======================================  
 ========= After Writing 2 bytes =========   
 Content  = CDEFGHIJLM  
 Position = 10  
 Limit   = 10  
 Remaining = 0  
 ======================================

On carefully examining the values getting advanced for position and remaining pointers, we were able to get a decent understanding about the internals of a byte buffer during various operations.

Takeaways:

flip is used to switch to read mode from write mode
rewind is used to re-read
clear is used to switch to write mode from read mode. In other words it is clearing the buffer to start writing
compact is used to switch to append mode when we are in read mode. It starts appending from last read byte. And before that it moves all the unread bytes to the initial position of the byte buffer. Visualise it like the operating system doing a disk compaction. Everything fragmented comes to the extreme left in a compacted way.
while(byteBuffer.hasRemaining()){byteBuffer.get()} can be used to exhaustively reading the byte buffer.
A simple workflow will be like

create;  
 repeat() {  
 put; // write  
 flip; // get ready to read  
 get; //read  
 clear; or compact; //get ready to write  
 }

Have a fun time with bytebuffer just like we had.

Wednesday, April 1, 2015

Runtime metrics of Java applications through Metrics by Yammer

Metrics by yammer provide handy ways to capture runtime metrics and statistics of various applications. Metrics provide various types of measurements like counters, gauges, histograms, timers etc. More on the same can be read from the Metrics documentation

This post is about creating a sample Java application which uses a Metrics counter and monitoring the counter through jconsole.

Maven dependency :

 <dependencies>  
   <dependency>  
     <groupId>io.dropwizard.metrics</groupId>  
     <artifactId>metrics-core</artifactId>  
     <version>3.1.0</version>  
   </dependency>  
 </dependencies>

Java class :

 import com.codahale.metrics.Counter;  
 import com.codahale.metrics.JmxReporter;  
 import com.codahale.metrics.MetricRegistry;  
 import static com.codahale.metrics.MetricRegistry.name;  
 public class YammerTest {  
  static final MetricRegistry metrics = new MetricRegistry();  
  private static final Counter iterations = metrics.counter(name(YammerTest.class,"iterations"));  
  private static JmxReporter reporter = JmxReporter.forRegistry(metrics).build();  
  public static void main(String[] args) {  
   //Counter Example  
   int counter = 0;  
   reporter.start();  
   while(true){  
    System.out.println("hello world");  
    iterations.inc();  
   }  
  }  
 }

Now run this code in a shell and it will start printing "hello world" continuously to the console. Open another terminal and start "jconsole", and connect to the local process "YammerTest".

Screen shot:

You should be able to see the metrics variable through the jconsole under the managed beans tab. This means that the Java application is successfully exposing the metrics counter using JMX, and we can easily ship the counters to any metrics store like Graphite or Ganglia.

Also read:
Oracle doc
yammer blog 1
Yammer blog 2