Wednesday, May 21, 2014

Indian Railways: What is the basic reason for the IRCTC website to be so slow and unpleasant to use?

Contrary to popular belief IRCTC is perhaps a compounded problem.

Sheer volumes and complexity 
  • They are the biggest E-Commerce Site in the country, about 2x the net sales of Flipkart.com the next name on that list AFAIK. The sort of demand they see in the peak tatkal booking hours I suspect is orders of magnitude higher than what flipkart can handle. For instance something as small (relatively) as the recent Moto-E launch in India took down flipkart infrastructure ( Flipkart server crashes as Motorola Moto E goes on sale ).
  • Over 8 billion trips per year are undertaken in Indian Railways and bulk of the booking happens through this system. Even on a record-breaking day, Amazon only sold 13.5 million items worldwide, whereas Indian Railways on the whole sells more than 21 million tickets on any average day. The electronic system was put in place in the 80s and has since received multiple incremental updates.
  • Of these, IRCTC books around 250 million tickets per annum. It operates just the outermost web layer. So, while webserver configurations, website architecture, and other improvements will play a role, problems with booking a ticket are spread across every layer up to the database. Moreover, the failures at each stage of booking compound because a person who failed at the last step will start afresh right from log in.
  • The problem is not just the insufficiently responsive ticketing interface, but the huge demand-supply ratio. Even if the whole process was totally seamless, we would’ve only ensured that the first 20 million or so people got their tickets every day, but 3 times as many people will still be left without a reservation.

Bottlenecks and solutions

Technical - 

Seat selection
  • Also the railway system is far more complex then say a straight forward purchase from amazon or a Google search.Tickets are commodities, while seats and berths are not! This adds to the complexity. Let’s say Amazon had an inventory of a million toothbrushes and 2 million people try to buy the same item. I’m assuming they’ll just need to decrement the counter and complete the transaction with buyers on a first-come first-served basis, without worrying about which exact toothbrush will go to whom (at the time of sale). In contrast, Indian Railways offers a precisely labelled berth or seat to a ticket buyer.
  • In database terms, a row-level lock is obtained on the berth, while thousands of transactions compete for that! A better way would be to sell tickets first and allot the berths later. If passengers can handle the sophistication, they can even do a check-in later, but I think we’re still quite some time away from that. The current logic for berth preference can be retained, but applied as a separate step. In reservation centres, there can be a separate berth allotment counter for confirmed ticket holders.
  • This could use a stateful precompute algorithm that generates the next candidate seat and provides it dynamically. Then generates the next one as a background async process.
Payments
  • Another huge bottleneck is in processing payments. It involves a series of menu selections by users each causing pages to load followed by complex handshakes between IRCTC, third party gateways, and the banks, security checks and so on. Each step is prone to failures too. On the whole, 29% of attempted payments failed.
  • Why not deduct money before ticket booking begins for the day and return it if booking is unsuccessful? Actually, IRCTC is instead considering a smarter move by which passengers can keep prepaid cash with them. Apart from the obvious performance improvement, the economic implications are huge. Imagine crores of rupees lying with them without a need to pay interest? Already, Indian Railways benefits from having an Advance Reservation Period of 4 months. Together, effectively, passenger money is deposited several months in advance. Remember Dell?
  • Possibly they could have credit cards on file . Google Play, Itunes and many others already do this, your next ticket could be a couple of clicks away, EVEN ON MOBILE !
Database
  • It becomes apparent a lot of this is to do with their database architecture they perhaps could implement a beefier database cluster with ample replication and sharding. A lot of the work is read bound and this could most certainly be improved.
  • If the database and application layers are appropriately coupled it should be possible to implement some sort of active caching, thus minimising rapid read impact without losing out on data validity.
  • A lot of the sporadic load comes around the tatkal tickets. This is a tiny fraction of the huge data open to booking. A good architecture would be to to shard this critical hot data into highly optimised database that would be tuned for concurrent locks and writes and offer high availability for read throughputs. Also tune your application to handle this separately optimised for  the job at hand. Use beefy hardware for the same (plenty of ram,cpu network bandwidth etc)

Application Design and Architecture

  • Toss out the monolithic web app and validation system
  • USING A CMS for this scale is  (IRCTC uses BroadVision’s CMS) an epic fail
  • Ditch the Windows Stack , Move to Unix based environments, enable Gzip compression , use stuff like Flashcache, HA-Proxy, Docker dynamic scaling deployments depending on the time of the day. For instance add 5x the power during peak. Use a combination of Cloud on demand and self hosted infra for best results.
  • Remove banner adverts, they as the railways should not need this , keep a clean minimal interface that does the job.
  • Move the session handling into another layer and should be much more graceful. The load of the 10 min timeout in a mainloop for instance is perhaps one of the main reason for many screwups.
  • Break up critical parts into modules that are designed to scale horizontally provided the right hardware.
  • CDN for static content with multi host and redundant bandwidth across locations in the country,
  • Revamp the core api to be data driven, off load client logic into the web browser using something like Backbone or Angular  - this achieves 2 things, api can be rigorously tested and hardened. Ui can be offloaded to another team that specializes in it , get better UI , easier to extend and open up possibilities for a much better api driven mobile client
Hardware & Networking

  • Update the database cluster for sure and most of the architecture that can no longer support these workloads.
  • Add edge points in all major cities and aptly geo located centres to minimise the response time (this would need a decoupled app imho for best results)
  • Sorry EC2 is no magic word like a lot of people suggest , EC2 would be a poor choice for IRCTC given where their traffic and business is based off.
  • Perhap at certain time use a Cloud hybrid deployment, you can use the elasticity of the cloud to meet peak requirements that are sporadic. If the demand is consistently at similar levels, add new physical hardware to your DC not to the cloud.
  • Add an Peakflow/Arbor and handle DDOS mitigation abilities as i suspect the railways site is victim to plenty given their importance. I would choose a device over other techniques as the 2x latencies introduced to mitigators/scrubbing centers (none in India) would just kill the purpose.
  • Dedicated redundant Bandwidth from multiple ISP , possible multihost.They always seem to choke on this, besides DDoS mitigation requires plenty of it. They are a sizable organization and should be able to get that from govt and non govt players


NON TECHNICAL

Yes IRCTC is at the end of the day a public sector body, here are some of the usual suspects
  • Bureaucracy in every step - delays getting the required people, hardware and technical expertise to fix the issue once and for all
  • Lack of political will to fix the irctc system to create a bias for the chap at the counter .
  • Possible corruption that benefits from a bad IRCTC experience such as the network of travel operators and agents with political connections/lobbies to ensure they stay in business.
  • Lack of funds
  • The top down culture could have put some people averse to change and ones lacking the technical inclination required so this may not feature on their priority list or could be fooled by the software vendor into sub standard software CMS running on a windows stack.

No comments:

Post a Comment