Saturday, November 19, 2016

Where are my logs


Every once in a while you get request to investigate a wired state in the system, or some other unexplained behaviors that don't match up to any known definition => a bug :-).  So what we should always do, even before trying to reproduce it, is to have a glimpse in the logs and try to locate the root cause. Now if you have a logging infrastructure with huge retention you are of the hook, but as it happens, you may have been too slim on developers or budget and decided to get a 3rd party solution, that again without paying top dollar will limit your retention period. So what now, the best of bugs are always couple of days after the last day of retention that you have. So what can you do?
Well, one way to go and still keep it slim, is to configure a daily dump of the logs to S3(check if your 3rd party logging provider allows that). So once configured you are of the hook, right? well wrong, no yet. You can locate the log file, you can fetch it, but how you are going to search in it? if you have a small to medium app you are looking at ~10GB per day, so what now?
a tool that I find very useful is glogg, It has no issue eating large log files like this one.

  1. Configure daily dump of logs to S3
  2. Use glogg to search in large files.
  3. Find the root cause
  4. One less bug in the system


Wednesday, November 16, 2016

DB heat map - learn how data is accessed the visual way


I recently tried to get a picture on the way my application is using the db, which tables are accessed and when they are accessed.

Many things can be learned by using pg extras (if you are on heroku), or you can use the source code and run the queries yourself without installing the add-on. You can get cache and index hit ratio, table sizes, index sizes, unused indices, long running queries and much more. but with all that useful information I still didn't get an understanding on the how and when my app is using the DB.

So I had an idea, I would like to generate an heat map, that will show me for each and every table, how much times its accessed. the brighter the color the more the table is accessed. with such an image I could immediately understand or at least have an idea on my app access pattern. I could know which tables are accessed the most in a period of time, which tables are accessed together and which tables are almost never accessed.

So in order to do that I will have to get the data from somewhere. Two options comes to mind:
  1. Implement some counter at the ActiveRecord level that will do the job. but that will take time, will have performance considerations and will have to be deployed. 
  2. Use the very verbose Rails log that writes each and every query made to the logs.

I choose the second option from the obvious reasons. Now that I have an approach I need to crunch the data.

This is the plan:
  1. Filter the logs to only keep the DB data.
  2. Analyze every line of log and assign it to a specific interval. each interval holds an array of counters, each element in the array corresponds to a specific table. 
  3. Take the result data structure and convert it to csv format.
  4. Plot the heat map using R.

The result: (stripped from real table data)

As you can see it is very easy to notice that there are some tables that are accessed all the time, while other hardly accessed at all. We can even get an idea on the correlation between several tables that are accessed at the same time. Very insightful.



Friday, November 11, 2016

Python, Scala != Java


This one is about readable code, or "I want coffee to be something else".

The other day I was looking at the introduction code examples of apache spark.
The code examples are written in Python, Scala and Java. As a rubiest I very much appreciate clean, elegant and joyful code. The first two are very nice to the eye, but Java, well I believe that you pretty much know what's going on there, right? Nothing but joy to the eye. long cumbersome syntax that makes you jump back to the Scala example in a lighting speed.



Thursday, March 10, 2016

Sidekiq workers memory bloat


Recently I had to deal with a situation where my Sidekiq worker start consuming too much memory. It doesn't seem to be leaking but rather consuming large amount of memory every once in awhile.

So first of all, I used Newrelic, it is so awesome. You can see the memory jumps there and pin point a specific worker that caused the jump. from there you can start going through the worker classes that were run and try figuring out what caused the memory jump.

This method helped me a lot on getting a good idea about what is going on. But I wanted to further pin point the suspects.  To do that I wrote a little sidekiq server middleware that will log every time that the memory consumption of the worker exceeds a certain threshold. Because several thredas are running at the 'same' time of the measurement there ma ybe a situation that I will not see fast working jobs that consume lots of memory, personally I think that it is less likely because fast methods usually consumes less memory, the other way of dealing with it is to set sidekiq to 1 thread (but this is not the best thing in production).

After using the middleware I was able to see that only few types of jobs (worker classes) caused the memory jump and after analyzing the jobs and the inputs I was able to detect the scenario that caused the memory bloat.  

The server middleware snippet:
This snippet makes use of NewRelic gem in order to get the memory sample.

class MemoryBloatLoggerMiddelware def call(worker, job, queue) threshold = 50 # Should be set as env var memory_sampler = before = memory_sampler.sampler.get_sample.to_i yield after = memory_sampler.sampler.get_sample.to_i memory = after - before puts "Memory bloat detected threshold:#{threshold} bloat: #{memory} #{job}" if memory > threshold end end
Now all that you need is to add this to the sidekiq server middleware.



Monday, February 8, 2016

Simple-events gem


My simple events gem that is still work in progress and I started to work is set on hold for now.
The reason that I stop the development is due to some higher priorities that I have including splitting the monolith AKA micro-services architecture and introduction of node technology as well.
But, just to set things straight, I have a real work example that is working, not yet a a gem but as a set of classes. It consists of a concern that is added to each active-record model that acts as a publisher. this concern calls a broker class that have mapping of subscribers for each action(create, update ...). The subscribers are mapped using a strategy pattern and on invocation, the broker iterate over the subscribers and call them with all relevant data (model type, model id, state) in an async manner using sidekiq worker. The worker is configured to with unique-jobs in order not to allow multiple jobs from the same kind to run twice(this is a support for race condition when multiple servers change the same model).

I hope that I'll have the time in the near future to update the gem with the implementation described above and release it.

That's all,


Monday, December 21, 2015

Tibitis - Performance issue, newrelic to the rescue


I just got my newrelic weekly digest and after a quick look I started to smile.  The app page views just went up by approximately 1000%. That is huge! but then I started to think about it a little more, and questions came up. Was I expecting such an increase, Is it possible that such increase happened and it was missed altogether? Beside that I saw that the page load times were really good, far too good so I started to wonder about the specific page that was being loaded. So because newrelic issued the report I started to deep dive into it. After looking around for sometime I went to the browser section, into the page views and onto transactions. There after a quick glance I found the specific URL that caused the massive increase. From there it was easy to see that the increase is somewhat code related and user related so the road to a quick fix is short.

Again newrelic to the rescue.



Thursday, December 3, 2015

Tibits - Publishing events with some ruby magic


I'm currently working on the simple-events gem that will help making things a little bit more organized in scenario of pub-sub in the context of the same ruby process. The classic in-app event driven way.
You can read more on the motivation here.

While I'm still in the middle of the work and I still ponder on the best way(s) to allow the publishing and subscribing, I wrote a small concept code that allows easy publication of event when a specific method is called.

Look at this code:

module Trackable
  def self.included(base)
    base.class_attribute :trackable_options_hash

  def fire_event
    puts "calling trackable"
    puts "event payload #{payload}" if self.respond_to?(:payload)

  module ClassMethods

    def trackable_options(opts={})
      self.trackable_options_hash = opts

    def get_trackable_options

    def register_future
       method = get_trackable_options[:key].to_s
       method_sym = method.to_sym
       aliased_method_sym = "#{method}_orig".to_sym
       alias_method aliased_method_sym, method_sym
       define_method(method_sym) do |*args|                  
        self.send(aliased_method_sym, *args)

class Publisher
  include Trackable

  def foo
    puts "in foo"

  def payload
      id: '12345'

  trackable_options key: 'foo'

So we the magic is in the Trackable module. In the Publisher class I define that after foo method is called I want the fire_event method to be called with payload. The code in the module alias the method, define a method on the fly that calls the original method and then calls the fire_event. One more important thing is that due to order of declarations I have to use the trackable_options that hooks the event after the foo method declaration.
Now this is a very useful and reusable way of defining a method to fire an event. When the code will be ready it will allow defining the event name as an option, the use of payload and more. It may also include an auto_loader that will allow us to put the trackable_options right after the include deceleration and make the code even cleaner.