Rambling's of a {Social} Tech Guy! http://amnigos.com Startups, Products, Marketing, Cloud and Cappuccino posterous.com Fri, 17 Feb 2012 10:11:00 -0800 How To Guide : Tata Docomo 3G on Ubuntu 11.10 http://amnigos.com/how-to-guide-tata-docomo-3g-on-ubuntu-1110 http://amnigos.com/how-to-guide-tata-docomo-3g-on-ubuntu-1110

It's pretty straight forward and follow below steps.

1. Connect your 3G stick and boot up Ubuntu.

2. Select network connections and click on the New Mobile Broadband Connection.

3. Now select continue in the dialog box and select India as country & then click continue.

4. Now it wil show list of service providers and DON'T select "Tata Docomo" as it's for Photon+ not for 3G. Instead select, I don't know my provider option and enter "TATA DOCOMO UMTS", click continue.

5. Under billing dialog, select "My plan is not listed" option and enter "tatadocomo3g" as APN, just click confirm and save your settings.

6.  Now under Network Connections, you could see "TATA DOCOMO UMTS connection" option and you can click on it. If it doesn't connetc to internet then just unplug & then re-plug you 3G stick.

That's it, you can use Tata Docomo 3G stick with your Ubuntu 11.10 :)

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Mon, 13 Feb 2012 07:50:00 -0800 Big Data and Hadoop in Cloud - Leveraging Amazon EMR http://amnigos.com/big-data-and-hadoop-in-cloud-leveraging-amazo http://amnigos.com/big-data-and-hadoop-in-cloud-leveraging-amazo

I did a talk last week at Barcamp Bangalore on "Big Data and Hadoop in Cloud - Leveraging Amazon EMR". The focus was to help audience understand Big Data and how to leverage frameworks like Hadoop to build context and derive insights. As big data is becoming a common use case and we need distributed systems that can store and take advantage of parallel processing to analyze growing data sets.

I spoke about Hadoop, Map Reduce in general and how to run Hadoop Map Reduce jobs using Amazon EMR service. Also shared some insights from managing hyper scale production Hadoop clusters and tuning for performance in general – Think 68400 GB RAM, 26000 CPUs and 1700000 GB Disks :)

Drop me a note if you have any specific comments. Would love to hear your feedback!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Thu, 09 Feb 2012 22:50:00 -0800 How To Guide - Hadoop MapReduce Debugging in Local Setup http://amnigos.com/how-to-guide-hadoop-mapreduce-debugging-in-lo http://amnigos.com/how-to-guide-hadoop-mapreduce-debugging-in-lo

One of the important setups after you have installed hadoop successfully and played with samples is to configure your development environment and figure out how to debug your map reduce programs written in Java.

Hadoop can be installed in the local environment in 3 different modes :

1. Local Mode

2. Pseudo Distributed Mode

3. Fully Distributed Mode (Cluster)

Typically you will be running your local hadoop setup in Pseudo Distributed Mode to leverage HDFS and Map Reduce(MR). However you cannot debug MR programs in this mode as each Map/Reduce task will be running in a separate JVM process so you need to switch back to Local mode where you can run your MR programs in a single JVM process.

Configure Hadoop for Debugging :

  • Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs. Below steps help you do it.
  • Configure Hadoop_Opts to enable debugging so when you run your Hadoop job, it will be waiting for the debugger to connect.
  • (export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)
  • Configure fs.default.name value in core-site.xml to file:/// from hdfs://. You won't using hdfs in local mode.
  • Configure mapred.job.tracker value in mapred-site.xml to local. This will instruct Hadoop to run MR tasks in a single JVM.
  • Create debug configuration for Eclipse and set the port to 8008 - typicla stuff.
  • Run your hadoop job (it will be waiting for the debugger to connect) and then launch Eclipse in debug mode.
  • Also use your favorite profiler to understand code level hotspots

 

How do you debug your MR programs?.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Fri, 03 Feb 2012 23:33:00 -0800 Amazon DynamoDB : Yet Another NoSQL but Powerful in Cloud http://amnigos.com/amazon-dynamodb-yet-another-nosql-but-powerfu http://amnigos.com/amazon-dynamodb-yet-another-nosql-but-powerfu

I have been using NoSQL databases in Amazon Cloud and one of the issues that you will get into is variable IO as your datastore grows exponentially. While I am really kicked about DynamoDB as fully managed NoSQL, what makes it stand apart from others or running your own NoSQL cluster is Performance. Yes, having SSD for storage is a real killer for controlling disk IO and automatic partioning for having a predictable performance for your read/write queries. Also the free tier makes it easy to explore and benchmark with your data. Have you tried it yet?.

 

P.S : AWS is having a free webinar on 15th Feb on DynamoDB - do register.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Wed, 04 Jan 2012 01:41:00 -0800 Amazon S3 Object Expiration : Leveraging it for Cloud Backups http://amnigos.com/amazon-s3-object-expiration-leveraging-it-for http://amnigos.com/amazon-s3-object-expiration-leveraging-it-for

One of the issues with storing large amounts of backups data in Amazon S3 is writing custom scripts to delete the data after certain timestamp and also It will be painful enough to manage and maintain pruning tasks when you have large scale heterogenous data for 100;s of application.

The object expiration feature in Amazon Simple Storage will be very useful for this task and will make it easier for developers/users. You can learn more about it here.

As mentioned on AWS website :

You can define Object Expiration rules for a set of objects in your bucket. Each expiration rule allows you to specify a prefix and an expiration period in days. The prefix field (e.g. “logs/”) identifies the object(s) subject to the expiration rule, and the expiration period specifies the number of days from creation date (i.e. age) after which object(s) should be removed. You may create multiple expiration rules for different prefixes. After an Object Expiration rule is added, the rule is applied to objects with the matching prefix that already exist in the bucket as well as new objects added to the bucket. Once the objects are past their expiration date, they will be queued for deletion. You will not be charged for storage for objects on or after their expiration date. Amazon S3 doesn’t charge you for using Object Expiration. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage. Using Object Expiration rules to schedule periodic removal of objects eliminates the need to build processes to identify objects for deletion and submit delete requests to Amazon S3.

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Wed, 14 Dec 2011 01:42:00 -0800 Performance Tuning : Why STRACE Is Your Best Friend? http://amnigos.com/performance-tuning-why-strace-is-your-best-fr http://amnigos.com/performance-tuning-why-strace-is-your-best-fr

I have been working with many customers over last 6 months on fixing performance issues across different stacks in production deployments. One of the tool that always comes to the rescue is strace - a process diagnostic tool. You can install it using one of your favorite package manager if it's not there already.

Most of the times tuning performance on a live production machine (if you have some decent scale) is like dealing with a patient in emergency ward - you have to act fast and fix things quickly. During any troubleshooting, there will be too many components in the system or application and you need time to understand the impact of them before getting into "lets fix this" mode.

I generally get started with the first/last component in the chain (typically a web/app or db server) and try to identify what it is doing through strace utility - it helped me so many times to identify whether the issue was with apache/nginx/php-fpm/uwsgi/java processes or some other blocking for a resource like DB or accessing certain system call too many times due to bad code (imagine iterating over 1000's of records and accessing timezone locale every time instead of caching it) in the application.

The simplest way to get diagnostic information is to attach your busy process (high cpu or memory) to the strace and just watch the system calls - it will be immensly helpful to know  process execution trace.

Attach a process to strace for diagnostics : $ strace -p <processid>

To see all open and read calls of a process : $ strace -e trace=open, read, close, connect  -p <pid>

Capture strace output for a process : $ strace -p <pid> -o /file/path/debug.php-fpm.txt

What is taking time? : $ strace -c -p <pid>

And you can options like -s to specify the maximum size of the output string to more than the default 32. It is the most powerful tool for troubleshooting things in Linux environments.

Happy debugging and good luck with tuning systems - the best job at times:)

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Thu, 08 Dec 2011 02:38:00 -0800 Startups - In the END, Only One Thing Matters For The World : SUCCESS! http://amnigos.com/for-startups-in-the-end-only-one-thing-matter http://amnigos.com/for-startups-in-the-end-only-one-thing-matter

So Taggle decided to move on instead of fighting for "last man standing in the game" - you can read more about whole story on Pluggd.in and also Mahesh Murthy's open letter on VC Circle to Taggle and John Kuruvilla's response to it {for more masala}.

Here is my take and why I disagree with Mahesh Murthy :

With all due respects to Mahesh, what is differentiators for PinStorm in the work they do? - how does it differ from other thousands of agencies out there in the world.  As someone said, future prediction (you can bullshit) and past analysis (you can reason to death) are the easiest things to do but creating something is damn f***king hard. I was surprised to see Mahesh's stand on Air Deccan saying"I told you so and it happened" because he generally preaches about "how predictions suck" during his talks at conferences :P

Failure vs Success in Startups :

For the world, ultimately only one thing that matters in our business: whether or not you win. Nobody cares about how you slogged or dragged. If you succeed then you will be glorified to immortality but if you fail then some people will thrash you to death {in some cases on how you suck and} on how your business sucked with every possible reason. As entreprenuer you will move on and do something that interests you while blogs and media (if you are popular enough) write about how they predicted failure or doomsday. My best wishes to Taggle team for trying something while hundred others were counting about how many deals sites are coming up, was I part of that Hundred? :)

I believe taking a dig is much easier than building something valuable and sustain it { And, also  going ahead and recreating success in another area }. I tried startups two times and both of them failed terribly, as all of us know no body gives a damn shit about failures after a week - period.

So as an entrepreneur, have fun while doing whatever you are doing {even if people say you must be a moron to do such a stupid thing} so in the end atleast you had fun if not success {even though that's what matters for the world}.

Gandhi1

 

Image Credit : Shamelessly copied from ripoffornot.org - go FreshDesk and do it cowboys :)

P.S : If you are building something and looking for smart guys then go hire engineers from the Taggle team.  I spoke to a few, they are really smart guys :)

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Sat, 03 Dec 2011 21:57:00 -0800 Building Culture at Kuliza - My talk at HR4Startups Event http://amnigos.com/building-culture-at-kuliza-my-talk-at-hr4star http://amnigos.com/building-culture-at-kuliza-my-talk-at-hr4star

I was part of panel discussion at HR4Startups event, IIM-B on 3rd December along with Amiya from  ZipDial, Pallav from FusionCharts and Kumud from SuperSeva on hiring, people management, challenges and what worked for us.

Lawsofhiring

I did a presentation on our beliefs and culture at Kuliza in hiring and having fun while doing what we are doing. Also why I believe "Culture is what separates great companies from others. Not Technology".

 

If you are running a startup then what worked for you?

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Tue, 29 Nov 2011 04:05:00 -0800 Netflix - Leveraging Public Cloud (AWS) http://amnigos.com/global-netflix-platform-leverage-public-cloud http://amnigos.com/global-netflix-platform-leverage-public-cloud

Netflix is poster boy of public cloud adoption for running large scale systems and their engineering blog has always shared most of their learnings and experience using AWS cloud - from designing fault tolerant systems, scaling simpleDB to Cassandra. The below presentation from Adrian Cockcroft shares their global platform details including why AWS cloud :).

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Sun, 27 Nov 2011 20:42:00 -0800 Cassandra in AWS Cloud :Summary from AWS User Group Bangalore Meetup - November http://amnigos.com/cassandra-in-aws-cloud-discussons-from-aws-us http://amnigos.com/cassandra-in-aws-cloud-discussons-from-aws-us

I have co-hosted AWS Cloud User Group Bangalore meetup for November at Kuliza Technologies with Sreekandh & Vivek.  This meetup theme was "Running Cassandra in Cloud" and was attended by around 15+ interested in exploring NoSQL solutions like Cassandra.

We started the meetup with introductions, tribe forming exercise (it was fun) and divided into 3 groups so each group can present one of the topics from introduction to NoSQL, hands-on Cassandra, schema design, CAP theoram, scalability and performance.

I did a quick hands-on session to run Cassandra in a single node and presented overview of all configuration parameters. All you need to run a Cassandra node was Java 1.6 runtime and you can download binaries for Winows (Dont forget to set your JAVA_HOME variable). We discussed the need for using different disks for Commit logs and Data directory including how to leverage Row Cache or Key Cache in Cassandra for improving Read performance in different usecases including various Read/Write consistency models available.

Also Cassandra ships with a default commandline client (cassandra-cli) which can be used to connect to server, create keyspaces/column families including writing/reading data. You can use one of the high level client bindings to work at programming level.

If you are interested in launching a Cassandra cluster without doing too much work then you can explore Whirr - an opensource cluster service that allows you to launch cloud based clusters for Cassandra, Hadoop, HBase etc in 10 mins.

It was lot of fun as you can see from the below pics :)

Awsmeetup1

Interested in attending or hosting the next session? - join us at AWS User Group on Meetup.com.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Wed, 23 Nov 2011 03:20:00 -0800 Heroku Launches DB as a Service for PostgreSQL http://amnigos.com/heroku-launches-db-as-a-service-for-postgresq http://amnigos.com/heroku-launches-db-as-a-service-for-postgresq

Heroku has launched a DBaaS for PostgreSQL - this is will be very useful given that PostgresSQL has large commuinty of users. You can sign up for it at http://postgres.heroku.com/

Also Heroku pitch is (as taken from their site)

  • A powerful, reliable, and durable open-source SQL-compliant database, PostgreSQL is the datastore of choice for serious applications. Now it is available in seconds with a single click. Never worry about servers. Never worry about config files. Never worry about patches. Simply focus on your data.
  • Databases are multi-ingress ; use them from any cloud, PaaS, or your local computer. It is easy to connect from common languages & frameworks including Rails, Django, PHP, and Java: configuration strings are generated for them automatically. Need to test a schema migration or perform load testing? Fork your database to create an exact copy of your schema and data".
  • Scale vertically by choosing from a range of plans . Plans differ based on the size of their hot-data-set, the portion of data available and optimized on-the-fly in high speed RAM. When the time comes, scale horizontally by adding read-only followers that stay up-to-date with the master database.
  • Forget daily backups, Continuous Protection redundantly archives data to high-durability storage as it is written, ensuring that it is safe no matter what. Automated health-checks are performed every 30 seconds to ensure that databases are available and working. And if something goes wrong, there is an ops team on call 24/7.

Would you use a DBaaS?

 

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
Thu, 17 Nov 2011 07:23:00 -0800 Troubleshooting Django(Python) Application with Nginx + uWSGI on MySQL http://amnigos.com/scaling-djangopython-application-with-nginx-u http://amnigos.com/scaling-djangopython-application-with-nginx-u

Good thing about working at Kuliza is, I get to experiment, play and work with different stacks for our customers. Recently, we started working with one of our cloud customer, a fast growing startup based out of Bangalore. They run their SaaS application in AWS cloud with Python(Django) on nginx + uWSGI using MySQL as backend.They have two main components in the application - frontend application serves the data from DB and backend workers write to DB.The application runs on 2 servers for handling http requests with Amazon ELB, 1 MySQL DB, 1 large server for MongoDB  and memcache.

One of their critical issue was unable to take backups from MySQL as it was freezing system and causing application servers to fail. They were running MySQL using Amazon RDS in a single zone and as known, it will suspend read/write when you perform any snapshot or backup using RDS API. After our initial revierw, we decided to move their DB to Multi-AZ deployment so the snapshots and backups will be performed on the stand-by system and it will cause any read/write freeze on the master. While we were able to take snapshots after RDS multi-az deployment,  their write latency has increased by 3x which was unusually high and Read/Write I/O was consistently increasing for quite sometime even before we moved to multi-az.

Write_after_multiaz

Along with this, their application was consistently getting into error mode as nginx was unable to connect to the upstream (uWSGI workers) and the impact was severe as almost 10% of their traffic could not be served because of this. They just moved to nginx+uWSGI from Apache + mod_uwsgi deployment (high CPU consumption and unable to scale due to heavy processes).

Initial focus :

  • Review nginx and uWSGI configurations - we found few misconfigurations related to worker_processes, worker_connections, listen queue etc and corrected them but it didn't help. Each of the app server were running in a box with 5 CPUs and 1.7 GB memory
  • The problem was very severe with multiple outages in a day. Nginx was reporting thousands of  "unable to connect to upstream" errors while uWSGI logs didn't had any specific errors. The system health stats were looking good even during outage time - CPU was around 20% and enough free memory. There was no major Disk I/O or Swapping or Interrupts or Context Switches.
  • A quick look at RDS stats confirm that multi-az has not affected read performance and the write performance has increased as we have synchronous replication with a stand-by for quick failover.
  • This production setup runs uWSGI with master process and predefined workers(20) with each worker configured to handle max-requests of 10K before re-spawning. So reduced the max-requests to 1000 to see if quick recycling would help but it has aggravated the problem because of frequent re-spawning. So the value was increased to 2500 but there was no improvement. There was also a belief that with Apache they never had outage issues.
  • We decided to increased our number of app servers from 2 to 4  to see if that would fix the problem but it caused an immediate outage so had to roll back. Initially we suspected this could be because of a centralized resource like DB or Cache server but we ruled out DB as RDS stats didn't not show any huge spikes in Read/Write Latency or IO during the outage.
  • The production outages become very frequent and the response times have increased to 20 to 30 seconds in few cases for 15 to 30 minutes and forced us to reload our nginx+uwsgi process.

Deep dive to figure out bottlenecks:

  •  At uwsgi level, we enabled harakiri for timeout and harakiri_verbose for logging the trace of system calls causing workers to hang. This is has helped us to isolate the problem to database as uwsgi was hanging on reads calls to database. You can also debug uwsgi process using "strace" utility in linux environment to identify the bottlenecks and monitor system calls.
  • When we looked at the MySQL process list, there were bunch of queries hanging on for 20 to 30 seconds of because of locks - this was because of contention between our write workers and read app servers. The issue was underlying tables were using MyISAM as storage engine and that causing wait locks on select queries due to frequent inserts/updates by workers. You can look at MySQL process status using "show processlist". We have moved the storage engine to "InnoDB" and that resolved most of the performance issues as uwsgi workers were no longer waiting on DB due to row level locking features in InnoDB. The below screenshots show the improvements for read/write I/O.

  • Also found that django_session table was having almost 35GB of data - there was no automatic purging of old records. This was causing I/O bottlenecks, increased our write latency on master DB. We are moving away from using DB as backend for storing sessions data to memcache.
  • Using pmap utility, we found that each uwsgi worker was using around 80MB of memory which was limiting the vertical scalability of concurrent requests within one server. Most of this memory was used by process private/working set which we later attributed to loading a separate static data as cache into each worker - However product team is now working on re-architecting application so we can read this static data from a memcache instead of loading it into each uwsgi worker. This will help us to achieve high concurrency by running more uwwsgi workers.

    Useful MySQL Queries :

     Find size of your database - SELECT table_schema "Data Base Name", SUM( data_length + index_length) / 1024 / 1024
    "Data Base Size in MB" FROM information_schema.TABLES GROUP BY table_schema ; 

    Find size of your tables - SELECT TABLE_NAME, table_rows, data_length, index_length, round(((data_length + index_length) / 1024 / 1024),2) "Size in MB"
    FROM information_schema.TABLES WHERE table_schema = "Your_Schema_Name";
         

    In our case, bottlenecks were clearly at MySQL level due to locking issues and heavy writes to session table. This application is running smoothly after above changes for last 2 weeks without any problems. One of the important lesson learned was always troubleshoot your centralized resource first when you get into performance or scalability issues.

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Wed, 16 Nov 2011 05:20:00 -0800 Amazon ELB New Features - HTTP Response Code Metrics and Improved DNS http://amnigos.com/amazon-elb-new-features-http-response-code-me http://amnigos.com/amazon-elb-new-features-http-response-code-me

    AWS has added two important new features to their Elastic Load Balancer to help developers monitor their application response codes and improved DNS lookup.

    HTTP response code monitoring : Amazon ELB will now report the count of all http responses across all attached instances. This will be very helpful to monitor the application response behavior at load balancer level to identify 5xx failures and 4xx issues. For details on ELB metrics, visit this documentation page. I have attached a screenshot from one of our production ELB instances (sampled at 5 mins interval) where we had 5xx issues due to bottlenecks at DB level.

    Elb_metrics_for_2xx_vs_5xx

    Improved DNS resolution of ELB lookup : Now DNS lookup of ELB name will result in returing an entry with upto 8 ip addresses dependending on your load balanced zones and instances. This will help clients to reduce additional lookups if they fail to connect to one of the load balancer ip address.

    The http response code metrics and monitoring will be of great help in alerting developers to quickly investigate the 5xx issues to find out what could have gone wrong by looking at logs without having to monitor logs :)

    Http_status_codes_vs_request_count
    Go ahead and create your cloudwatch alarms for 5xx errors.

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Mon, 14 Nov 2011 05:07:00 -0800 Weblogic to Tomcat Migration - Problems and Learnings http://amnigos.com/weblogic-to-tomcat-migration-problems-and-lea http://amnigos.com/weblogic-to-tomcat-migration-problems-and-lea

    We migrated one of the Java/Spring application from Weblogic to Tomcat environment. The reason to migrate from Weblogic was purely based on the business case as they didn't want to pay any  licensing costs. The migration involved some simple code and configuration related changes, however there were some serious performance issues while running the application in load balanced Apache Tomcat environment but the same code base on Weblogic was working without any issues.

    While we work with this customer only on their cloud management and the product development is managed by another company, our cloud team was actively involved in identifying the root cause for performance hit in Tomcat environment. The initial focus was to review Tomcat and load balancer configurations, collect data points from web server, app server and database server to isolate the component causing bottlenecks. This deployment environment had a load balanced webserver running Apache + Mod_JK managing 4 Tomcat servers in different boxes and one DB server. Also the authentication/authorization was managed by a SAML SSO service.

    We have installed Application Manager product from Manage Engine in our Dev environment to monitor Apache, Tomcat and Oracle processes. Generally we use Yourkit for monitoring but since this was a load balanced environment and need to capture metrics across different components in the stack, we opted for Manage Engine - it's a cool product from ZOHO :)

    Key findings from our monitoring and review:

    1. Web server performance was good and didn't find any issues at load balancer.

    2. Huge spike in CPU/Memory usage in Tomcat during particular usecases in application while on Weblogic it doesn't happen for the same codebase.

    3. High spike in Network In/Out at DB server level for same sql queries when compared between Tomcat vs Weblogic.

    4. Ajax calls to server side application were slower on Tomcat compared Weblogic.

    5. The major configuration difference is connection pooling providers between Weblogic and Tomcat.

    With above data points we were able to isolate that there is some significant difference between Tomcat and Weblogic in terms of application to database communication. While digging more into this we found that Weblogic has a good connection pool management and uses preparedstatements cache which was helping in application performance, Tomcat was using DBCP connection pool.

    Weblogic_vs_tomcat_on_db_for_same_query

    So we experimented with upgrading DBCP to the latest version, tried different parameters tuning including preparedstatements caching but performance was degrading faster and also comparing it's performance against C3P0 it was slower. After deploying C3P0 on Tomcat the performance of the application has improved significantly and also preparedstatements caching was providing similar experience like Weblogic deployment eventhough there was a slight degrade after continuous usage (as we were limiting preparedstatements caching to few hundreds).

    The key thing to focus on while configuring connection pools in application servers is to understand the initial pool size required on startup, the max number of idle connections and prepared statements caching. Also we tuned the minThreads in Tomcat on startup to increase the parallelism and improve ajax calls performance.

    This enterprise application is now in production supporting few hundred users and the performance has been good till now :)

     

     

     

     

     

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Sun, 06 Nov 2011 06:37:00 -0800 MongoDB Rants and The NoSQL Love - Hate Crisis http://amnigos.com/mongodb-love-and-hate-crisis http://amnigos.com/mongodb-love-and-hate-crisis

    MongoDB is one of the popular NoSQL players and has attracted many big customers including FourSquare, CraigsList to hundreds of other large to small scale startups.

    The recent MongoDB threads(ranting) on HackerNews about  "Don't Use MongoDB" and "Failing with MongoDB" have created quite a bit of discussions, also resulted in posts supporting MongoDB and inviting serious discussion around the NoSQL movement.

    One of the conclusions from the serious rant was not about MongoDB bugs but about their code release culture with specific comments like -

    The real problem is that so many of these problems existed
    in the first place.
    
    Database developers must be held to a higher standard than
    your average developer.  Namely, your priority list should
    typically be something like:
    
     1. Don't lose data, be very deterministic with data
     2. Employ practices to stay available
     3. Multi-node scalability
     4. Minimize latency at 99% and 95%
     5. Raw req/s per resource
    
    10gen's order seems to be, #5, then everything else in some
    order.  #1 ain't in the top 3.
    
    These failings, and the implied priorities of the company,
    indicate a basic cultural problem, irrespective of whatever
    problems exist in any single release:  a lack of the requisite
    discipline to design database systems businesses should bet on.

    Eliot, CTO of 10Gen(MongoDB) has responded to the concerns with some specific comments and reasoning behind some of the issues.

    My take is we are still in an very early market for NoSQL products and the concerns are valid given the large scale deployments in production due to Big Data shift - any data loss or serious issues at DB level is the worst thing for a company. Even FourSquare famous outage was attributed to MongoDB sharing & replication issue. Also the adoption of NoSQL could be atttributed to the developers who percieve it as their pain killer for huge data needs without having complete understanding of their data, future growth, ACID requirements and usage patterns.

    Hope these debates and concerns will help to further improve NoSQL products and create required awareness as we need either NoSQL or SQL products to handle the Big Data shift (which has already took place for most businesses).

     

     

     

     

     

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Fri, 04 Nov 2011 05:09:00 -0700 AWS Cloud Computing India Tour 2011 http://amnigos.com/aws-cloud-computing-india-tour-2011 http://amnigos.com/aws-cloud-computing-india-tour-2011

    Amazon Web Services is doing a cloud tour in India across all major cities - Bengaluru, Mumbai, Chennai and Delhi. These events at city level would be a good place for you to connect and understand how to leverage cloud for your business.

    Also Amazon CTO, Dr. Werner Vogels will doing a keynote address and moderate panel discussion with thier Indian customers who leveraged AWS for their business. I will be speaking at the Bengaluru event on 17th November.Register for Bengaluru event here.

    As listed on the AWS site, below are some of the reasons to attend these events :

    • Discover how the AWS Cloud helps businesses drive cost efficiency and accelerate time to market to support their business growth.
    • Hear how AWS customers including Classle Knowledge, Consim, Dialify, Tatasky, Hungama, Kuliza, Myntra, NDTV, Reasoning, redBus, Sapient, UTV Interactive, Ventuno Technologies and Zenga Media have successfully built and migrated a variety of applications to the AWS Cloud.
    • Learn techniques and best practices on how to design fault tolerant applications on the AWS Cloud (Technical Track).
    • Hear how Amazon's own back-end technology infrastructure has successfully built and migrated a variety of applications to the Cloud.
    • Network with the AWS Asia Pacific team, customers and partners.

    Also YourStory.in is doing a Cloud Conclave at IIM-B on 19th November.

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Thu, 03 Nov 2011 00:18:42 -0700 Troubleshooting Oracle database - AWR, reports StatsPack and VMSTAT http://amnigos.com/troubleshooting-oracle-database-awr-reports-s http://amnigos.com/troubleshooting-oracle-database-awr-reports-s

    In "sad story of Oracle 10g minor versions" blog post, I have outlined some of the issues that we faced with a new oracle database instance in AWS cloud. While troubleshooting those problems, we have learned few quick tips that will help to identify specific bottlenecks in the Oracle database instance.

    Automated Workload Repository (AWR) report : This is one of the cool performance metrics gathering tool in Oracle 10g which can be used to generate time-elasped data from the running database instance. This is very helpful for gathering statistics from a production Oracle instance to understand the bottlenecks and tune the specific settings. This is a good choice if you don't have access Oracle Enterprise Manager.

    You can find the specific infromation on AWR statistics from Oracle documentation, I have highlighted the important statistics below.

    • Wait events used to identify performance problems.
    • Time model statistics indicating the amount of DB time associated with a process from the V$SESS_TIME_MODEL and V$SYS_TIME_MODEL views.
    • Active Session History (ASH) statistics from the V$ACTIVE_SESSION_HISTORY view.
    • Some system and session statistics from the V$SYSSTAT and V$SESSTAT views.
    • Object access and usage statistics.
    • Resource intensive SQL statements - long running queries.

    The AWR repository is a source of information for several other Oracle 10g features including:

    • Automatic Database Diagnostic Monitor
    • SQL Tuning Advisor
    • Undo Advisor
    • Segment Advisor

    You can generate AWR reports in either html or text format using the awrsqrpi.sql script as outlined in this page. The HTML reports shows all metrics including the actual wait events, long running sql queries and full table scans etc.
     

    Statspack Analyzer Tool : You can use this online statspack analyzer tool by generating AWR report in text format. Just copy paste the AWR report contents and it will show you the recommendations on whether you need to tune particular parameters like db cache size, sort area size, pool size etc.

    OS Watcher : While we didn't use OSW actively but this tool will help in collecting system and network metrics of Oracle instance. It has bunch of scripts which will get OS stats using vmstat, netstat and iostat etc on all support OS platforms.  You can download it from Oracle support site.

    In our case to identify the CPU bottlenecks, we used vmstat utility and monitored the run queue size, at any point the r value of vmstat output should be less than actual number of cpu's available in your machine to be sure that you don't have CPU contention issues.

     

     

     

     

     

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Sat, 29 Oct 2011 06:42:00 -0700 It's not Cloud, STUPID! http://amnigos.com/its-not-cloud-stupid http://amnigos.com/its-not-cloud-stupid

    So Mixpanel has decided to move out of cloud and has listed some reasons which pushed them to make that call, thier {fancy titled} post has evoked some strong reactions as you can see from the comments and HN discussions. And also I saw a post from an Indian startup (Yeah some good friends at Latlong, hello Sud?) justifying why they moved out of cloud to dedicated hardware.   

    While I have no issues with people making informed choices based on their own data (or pain) points, i.e eat your own dog foodwhat really sucks is drawing conclusions about cloud, calling it an overkill or saying it's not for running a production and justifications on why it won't work. I would like these critic discussions to help people in making informed choices rather than scaring them away :) so here is my personal take on "To Cloud or not to Cloud" discussions.

    1. Cloud is costly while I thought it will be cheap - If you think cloud is cheaper for all your use cases compared to dedicated hosting then you need to brush up your maths or you have no clue of real costs. Don't jump on the bandwagon because of a deal or someone said cloud is cool and then complain cloud is costly, because it is, and most likely will be in near future unless you have specific usecases where it can help you save costs.

    Lets say you are running a cricket application (anything with spikes or unpredictable traffic patterns)  which has hundreds of requests/second during live matches while not even few requests otherwise so we need an on-demand hardware that can easily scale without worrying so I need cloud and use it exactly for that purpose. Can I ask a dedicated hosting provider to tune my machines as load varies?.

    We run bunch of enterprise apps and need to process data from one format into different formats - so need machines once in a while to take care of this automatically. I don't find better option than spot instances or on-demand in cloud where we can just fire up a machine and run our task to release it. Also we have a customer who wants to measure their production site page load times everyday and need to run it through real browsers (yeah I know about New Relic, lets keep it for next discussion) - we spin up a new machine for one hour, run those required scripts, collect data points, send an email to the group and close that machine. I don't know if I can easily bargain $5 month with dedicated hosting provider to do this and have good flexibility.

    My point is you need to understand your application, business usecases, business continuity (the lame term for keeping systems/data available always) and production scenarios before you can say "hey cloud is costly" or "cloud works for me". And also instead of generalizing Pay Per Use models of cloud as costly, lets get our basics right.


    2. OMG! Cloud Systems Fail and Amazon had Outages - I have limited exposure to (just 5+ years) systems management so not a real expert in data centers management but have used and managed 10's of physical servers from providers like GoDaddy, ThePlanet, SoftLayer to having "hey our own" mini data center through  age old VMWare products on bunch of physical machines. The real issue is things fail everywhere - power might go down or LAN at GoDaddy datacenter might not work or Softlayer can complain about the unknown random hardware issue, disks will fail and processors will degrade for no known reason.Ask anyone who worked in a real data center - they will tell you all those stories { of  wake up calls to connect a cable unplugged for no god damn reason :)} which most of us might think just pure BS or made up. It might be possible that virtual machines might crash more often than a physical machine so you might loose machines in cloud but hey you can also bring new systems up in 10 mins - good luck if some ticket based hosting provider said he can do it at the same speed.

    So the bottom line is data centers could fail, power grids will go down(even in U.S), hurricanes might threaten data centers availability and the human beings might screw-up things during upgrades at network or system config or data center level. If you are a real systems guy and worried about failures at systems level then you should follow "chaos monkey" pattern, thanks to Netflix for some real-world usecases - Terminate your production systems or components or applications randomly to see if you can continue or sustain failures - it's not about being crazy but being paranoid to survive.

    You need to architect for reliability, availability and fail-over :

    Either you will architect with multi region level system or have stand-by in Europe for your primary data center in US WEST with third level backups sitting in Singapore and try to squeeze long distance network data transfer speeds using Tsunami UDP to keep data in sync between different continents, use a robust DNS service for quick mappings and  have db level replication from slaves.  If you run a personal blog {& whatever application } and don't care even if it goes for days then you don't need to worry about anything but cost of hosting it. So availability, reliability and fault-tolerance of your application is not hardware or network problem, lets stop blaming Amazon or cloud provider for it and figure out our deployment design and architecture - isn't it {systems} engineering problem?. You need to design for a failures even at entire cloud level because you need it. I have written about it in AWS context earlier.

    Durring Hurricane Irene, we did a DR for hundreds of our servers in US-EAST even though we were running in multi-az because what is important for business  is to recover systems and data even if whole cloud region goes down. At end of the day I am happy because some of these outages will expose the best practices required to run and manage reliable systems in cloud (of course I will be more happy if there was no outage at all in the first place) and also I don't need to convince my business or customers on why we need to design a fault-tolerant systems (as obviously they will cost more from $$$ stand point) and need extra resources.

    If you are looking at reliable, scalable and fault-tolerant design then the question is not whether dedicated hosting or cloud but as I said earlier - KNOW WHAT YOU NEED FOR YOUR APPLICATION IN PRODUCTION.

    3. This shared services on CLOUD SUCK and Performance is variable-  While mixpanel discussed certain things related to virtual disks and I/O issues faced at RackSpace even I have noticed some issues with Amazon { we have 500+ servers with probably thousands of EBS drives} and variable performance is a fact but nothing like 10x degradation - it mostly 10 to 30% variable at few times. Yes you won't get the same I/O that you will get on a dedicated physical hardware but you could certainly fix the I/O bottlenecks with other approaches. The question as mixpanel said is what's your long term approach and do you have enough time to tune shared cloud services to yield good performance for your usecase.

    I love S3 on AWS and yes it's a shared service - you might notice latency for read/write randomly if other co-located applications are stressing out (very very rarely but still) that service but I have not seen any S3 like service on dedicated hardware which can hold 566 billion objects, lets not say cloud based shared systems are for testing quick hacks. Amazon/Google/Microsoft are running their own systems on their cloud platforms:).

    The real question here is whether you want to use a particular shared service or not? where you have only API level control,  is as much as a business decision as technical. Do you want to use SQS or build your own queue management? - what if you purely build your product on top of SQS, EMR and SNS etc with tight dependency with Amazon cloud then can you easily move to other cloud system?. What's important for you at this point? - is it time to market or get product out or future scalability/performance or easy portability or some other business metric.

    4. I need Flash Drives and run SSD's  - Because I think my application I/O can be solved only with new hardware or disks, good luck because anyway you cannot plug the ssd drive you want on cloud through web console. Yes, new hardware will be helpful and Amazon or Google or  Rackspace might be running their systems on 2005//06/07/08 hardware. If you are really worried about running your application on latest hardware then don't even think about cloud platforms or build your own cloud if you want to do the undifferentiated heavy lifting at infrastructure level unless you think thats your core competence or need it for compliance.

    5. Elastic : Auto Provisioning of Resources - The question is do you really need this for your application or production?. Lets say you are running an e-commerce store for India and have a peak traffic during day time and need more resources which can be released after 8 PM IST. You cannot get this kind of benefit by calling an API unless you build your own cloud on top of dedicated hardware - we manage one of the e-commerce portal in cloud and noticed cost savings. Just read how Neflix or Foursquare leverage cloud at large scale to save huge costs because of cloud.

    6. Cloud is not "fix all my production problems" Solution -  I don't need to write much about this because there is no 100% solution for every problem out there for your production system management.

    I manage a cloud team at Kuliza - which basically provides managed services on cloud for customers (earlier we used to do this on physical hardwares). I have seen startups running their entire stack (php,memcache,mysql etc) in one super large box assuming cloud will handle crashes and recovery - I am not kidding so as I said earlier you still need to design and architect systems for availability, fail-over and reliability. If you have problems in cloud then it;s mostly like your own issue - there are ton of bad design deployments on cloud. Also I am not touching about security in cloud for data etc as it will be big enough topic to discuss separately.

    7. Every Cloud provider is Not Same so is Cloud - Lets accept this fact. There are only one { or two } leader's in public clouds, same is true for private cloud platform providers and very few reliable hybrid cloud providers. While Amazon is leading the public clouds trend and solving some hard problems while making it easier for their competitors to not repeat the same mistake (as told by one of their VP during our discussion at Kuliza office :D). You really need to choose your provider carefully rather than jumping onto some existing hosting provider who runs some software stack on their hardware to call it - I also got Cloud for you.

    Enough rants so whats your take? - I believe if some cloud is bad then it's mostly because of our system design and architecture not really because of some DDR2 RAM or Processor from 2005 :)


    P.S : Screw grammar - I don't care much. Even I used a fancy post title :p

     

    Update after HackerStreet India discussion : It's a pure myth to think CPU or I/O bound apps cannot stand in cloud. We run processing clusters (EMR & gridgain) for some heavy duty jobs - they work seamlessly - some of this is legacy processing stuff moved to cloud last year. Even we run txtWeb platform which peaks to 1000's of requests per minute on 2 nodes and it is high cpu bound. Look at netflix they run one of the most heavy CPU and I/O bound stuff (media format conversions, compressions, optimizations, streaming). The issue is can u fix I/O on small node or you need a large node for high I/O if so whats the cost and design implications.

    The real issue is people design stuff without understanding underlying things - like running heavy duty job on micro (if u keep hitting high cpu bursts then machine will go to unresponsive) or small node where disk I/O for EBS will be limited and expecting it to work like a real hardware :)

     

     


     

     

     

     

     

     

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Wed, 26 Oct 2011 21:02:00 -0700 Sad story of Oracle 10g minor versions - severe bugs and performance impact http://amnigos.com/76213430 http://amnigos.com/76213430

    We deployed an Oracle 10g - 10.2.0.1 version in one of our production deployments in AWS cloud - this was a typical migration from data center deployment to cloud. This database size was around 15 GB with few tables having 10 to 15+ million rows with bunch of functions, views and materialized views.

    The moment we rolled out to live production, application users started complaining about slow performance for reads/writes. One of the severe bottleneck was we were unable to even compile certain views, refreshing DB materialized views was taking almost 2X time compared to the previous production and one Mview with 4GB dataset is failing to refresh even after 8 hours. Our first target was to review the database configuration parameters - this machine had 15 GB memory and 8 CPU's so we tweaked the SGA Target Area, Sort Area Size, DB Cache Size, Shared Pool Size etc and  increase the table space but those things didn't help. We were pretty sure that bottleneck was neither CPU or Memory or Disk I/O related.

    But the performance on this instance degraded to a point where we couldn't even create a new user account as DB was getting into hung state. When we looked at query plans used by Oracle for different select queries, it become clear that it was doing FULL TABLE SCANS on our large tables despite having indexes on the required columns so we generated stats on the important tables using DBMS STATS package (gather_schema_stats and gather_table_stats) but that didn't help in avoiding FULL TABLE SCANS in thequery  plan selected by Oracle. The impact was worse as shared by one business user who sent a screenshot where simple query took 25 minutes to fetch 5K rows.

    So one of our cloud team members (Oracle DBA) started looking at it and found the issue was related to two areas - missing stas in the new database (auto stats was not working) and indexes related bugs in Oracle 10.2.0.1 version. Even after generating the required stats on 10.2.0.1 and importing stats from the old database didn't help either in fixing these issues.

    Finally, we decided to upgrade our database by installing patches on Oracle 10g (10.2.0.1) to 10.2.0.3 version and found that query plans are using proper indexes and other issues related to compiling views were also fixed, our application read/write performance was as expected.

    This whole troubleshooting exercise went for more than a week with regular complaints from application users and frustation of business owners - the lesson learned was to be very careful while choosing the minor versions in Oracle DB (atleast 10g).

    Hope the cost of DBaaS for Oracle will come down in Amazon RDS soon so every one can adopt a managed service where you don't need to worry about patching minor versions, scaling, backups  etc and also looking forward to Oracle's own DBaaS offering. The future of Platform as a Service (be it database or caching or actual component for building your application) is very strong :)

    Note : There are even some known issues with 10.2.0.3 but for now we are sticking to it for time being but evaluating all recommended patches realted to security and performance.

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati
    Mon, 24 Oct 2011 23:21:00 -0700 Love LISP - Thank you John McCarthy for GC, AI and RIP http://amnigos.com/love-lisp-thank-you-john-mccarthy-for-gc-ai-a http://amnigos.com/love-lisp-thank-you-john-mccarthy-for-gc-ai-a

    It's sad to hear that John McCarthy, the father of AI and LISP inventor passed away. I started my programming with COBOL and C, discovered LISP while working at Trilogy for Gensym product - a complex system with AI and NN components used to model real world expert systems used in defense, manufacuring industries & even space sations for mission critical monitoring and cause-effect analysis etc and fell in love with the launguage because  of it's simplicity.

    LISP was the first language to have conditional constructs, literals and whole code could be written as bunch of expressions - the beauty of symbols, lists and constructs was pure awesomeness. Before Trilogy acquired Gensym, they even had their own LISP compiler (later ported it to Common LISP) and built a translator to convert LISP code to C/C++ executable for deployment - Gensym also had a language called G2 KB which is used by application developers to model their business logic using bunch of symbols, functions and expressions exposed in simple English constructs and via GUI components.

    I still remember how difficult (yeah thanks to few years work with J2EE and .NET stuff ) it was initially for me to understand those LISP literals with complex expressions with nested functions - sometimes used to wonder how come such a little piece of code could do so much (recursion was another core beauty). Going through Thousands of LOC of LISP in Gensym core engine and understand it to fix bugs was fun and yet challenging - initially I worked on tuning forward and backward propagation algorithms for cause-effect analysis of events/data points from real world componets/systems and it was my first real world use of Neural Networks. Once LISP become familiar it was so much fun to work, modify G2 core behavior and add features like messaging queues and escalation notifications through alarms.

    Thank you John McCarthy for introducing Garbage Collection, LISP and AI to the world, we will miss you and RIP.

     

     

     

     

     

     

     

     

    Permalink | Leave a comment  »

    ]]>
    http://files.posterous.com/user_profile_pics/1428978/186957_551746400_7247191_n.jpg http://posterous.com/users/4aQW2ns2vU3v Vijay Rayapati amnigos Vijay Rayapati