Aquaboutic | Focus Security Research | Vulnerability Exploit | POC


evolution of cdn static architecture of tmall browsing application

Posted by loope at 2020-03-31

Absrtact: the increase of dual 11 traffic brings technical challenges such as capacity evaluation, hardware expansion and performance optimization to tmall browsing system. This paper explains how to separate static and dynamic information by static technology, store static content by cache technology, and load and fill a small amount of dynamic data asynchronously to solve the above problems.

In tmall's double 11 activities, browsing systems such as product details and stores will usually bear the impact of traffic several times or even dozens of times more than daily. With the rapid increase of the traffic of "double 11" over the years, these browsing systems are faced with various technical challenges, such as capacity evaluation, hardware expansion, performance optimization and so on. Therefore, the focus of architecture is how to make use of reasonable cost to deal with the rapid peak requests, and to ensure the scalability of system capacity, the stability of user response time, and the high availability of external dependent systems when problems occur. In addition, as the most important page traffic bearing system, the security and stability requirements such as anti creep attack, flow control and disaster recovery should be considered in the architecture, and the network bandwidth, hardware cost, cache efficiency and other factors should be comprehensively measured to find the balance point, so as to achieve the ideal effect of constant and ever-changing.


Therefore, since 2011, with tmall product details system as the representative, one of the main tasks of tmall browsing system in the architecture is to realize the dynamic and static information separation through the static technology, use the cache technology to store the static content, and use a small amount of dynamic data to load and fill asynchronously. The whole process has gone through three stages (as shown in Figure 1) from single machine static and unified cache access to complete CDN before the double 11 of 2013, effectively solving key problems such as cache hit rate, natural traffic distribution, system expansion and simplification, and client response speed.

Figure 1 three stages of CDN

At present, the latest static architecture based on CDN used by tmall browsing system can meet the original expectation of high availability and continuous scaling, and includes the following features.

Aiming at this optimization process, this paper will give an overview of the main technical challenges, architecture transformation strategies and the final optimization results, and focus on the evolution of the overall architecture, cache failure mechanism, dynamic content filling and other specific points in the CDN process.

First stage: system static

In the early days, tmall browsing system mostly used simple architecture to realize a very thin front-end application. Taking tmall's product details system as an example, aiming at the data center interface mode with large amount of visits such as products and users, the application client-side cache is transformed to pre cache, and page cache is widely used to reduce the pressure on the back-end system, so that the overall application level expansion is not limited. The main problems and challenges faced by the system at this stage include the following.

From the point of view of problems, it is difficult to avoid the bottleneck of optimization based on the original dynamic browsing system mode, such as the following points.

On the whole, we must start from the architecture. In the direction of architecture optimization, consider the following three aspects.

Therefore, since 2012, the transformation project of dynamic browsing system has been officially launched to solve the above problems through static means. That is to say, based on the business, the content of the original dynamic system is separated from the dynamic system, the irrelevant part of the browser is cached, and the dynamic content is filled with CSI. Specific considerations are focused on three aspects: static and dynamic information separation, static cache mode, and cache failure mechanism. Figure 2 shows the overall static architecture of phase I.

Figure 2. Static overall structure of phase I

Static and dynamic separation

The content of the original page is divided by business, from the perspective of browsing users, information publishers, time, region, private (cookies, etc.) information and other dimensions, the content in the page that is relatively public and does not rely on the above factors, and has a low frequency of change, is taken as the basis to generate static content. After static, the page URL is fixed, different URLs represent different content, the request returned by the server is related to the URL, and other dynamic content is called through asynchronous interface and filled through CSI. Take the commodity details system as an example, the basic information of commodities such as title, commodity details, sales attribute combination and other information are directly stored in the cache after being static, and other dynamic information such as discount, inventory, logistics, service and other information are filled into the static page frame through asynchronous call.

Cache mode

As a whole, it can be divided into application server, web server, CDN node and client browser four-tier caching system (as shown in Figure 3), which respectively carry different missions.

Figure 3 cache overall partition

In the aspect of cache system, considering the development cost, stability and I / O performance, TAIR, a widely used distributed key / value system in Ali, is selected to access the static pages. Compared with the nginx local hard disk cache mode, the local TAIR has better read-write performance, less impact of server response time and load fluctuation, and low cost of use and maintenance. The whole system is detailed as follows.

Cache invalidation

Cache invalidation mainly includes two mechanisms: "active invalidation in the background of invalidation" and "automatic invalidation of cache expiration". For active failure, the main technical difficulties include the following three aspects.

Taking the commodity details system as an example, the failure sources are mainly commodity data and shop decoration information. When the background user changes the corresponding content, the failure background will be notified through the message mechanism. The invalidation background receives the message and keeps the commodity ID to be invalidated. The general process is shown in Figure 4 by calling the local TAIR interface to invalidate the cache.

Figure 4 cache failure process

Reconstruction effect

Taking tmall's product details system as an example, after adopting the static architecture, at 11:00 on November 2012, in terms of performance, combined with the optimization work such as shop decoration separation completed in the later stage, under the condition of 80% cache hit rate of single machine (physical machine) of the system, the security QPS (query rate per second) is more than 7 times higher than that of single machine in the same period of 2011, and the system resources are less than 50% of the original. At the same time, static also solves the problem of single URL hot spot attack. More importantly, it makes the back-end java system dependent on the original dynamic architecture turn into weak dependency: on the one hand, it protects the back-end system to a certain extent through the static cache layer; on the other hand, in the extreme case, when the back-end system is not available, it can maintain part of the access through the cache.

Stage 2: unified Web Caching

In the first stage, the static structure transformation based on commodity details has achieved good results. In addition to tmall's commodity details system taking the lead in completing the transformation, browsing business systems such as stores have also quickly completed the structure adjustment with reference to similar schemes. In the process, the static technical specifications are gradually established to simplify the access steps; at the same time, it is also found that in their own systems, although based on browsing business scenarios, there are some common problems related to the static cache system, including the following points, due to the differences in the details of the cache scheme adopted.

Therefore, it is natural to think that it is necessary to unify Web cache layer access and share static cluster to save cost, improve stability and hit rate. From the perspective of operation and maintenance:

To build a unified access layer, we need to make some changes for each browsing system. The overall technical problems that need to be solved mainly involve the following parts from the perspective of architecture level.

Cache system selection

In the first stage, each browsing system adopts the single machine cache mode, which is slightly different based on cost, business scenario and other factors. To build a unified access layer, it needs to be able to take into account the special requirements of each browsing system, support the common ESI analysis and gzip compression under ESI mode, and complete the local dynamic content server filling of static pages; in terms of performance, it can meet the QPS (visit per second) under the double 11 / double 12 flow pressure Rate) requirements; support invalidation protocol and long connection, and perform batch invalidation. Based on the above analysis, and considering the final CDN deployment mode of static content in the future, the unified access layer cache final software layer can support all the above functions, including fast failure and preheating capabilities, support script combination of CSS and JavaScript, long connection and batch failure, support programmable configuration based on HTTP header, etc.

Unified failure mechanism

Corresponding to the change of cache software, each browsing system connected to the unified cache needs to modify the original failure mechanism for the new cache system and protocol, and use the public protocol standard to implement the active failure of batch and single object. At the same time, a unified failure center and cache verification layer are established, and all active failure requests of access applications are executed through the failure center by purge. In terms of the underlying failure source, monitor the change of information source data. For example, when the product is edited, including the product title description and other updates, the detail page needs to fail, and the active failure is based on the real-time monitoring and message mechanism (as shown in Figure 5).

Figure 5 active failure based on fact monitoring and message mechanism

Transformation of web server

The web service layer before the cache layer needs to be able to support consistent hash grouping, integrate the session framework used by the existing system, and support the dynamic configuration of the domain name based virtual host. For this reason, colleagues from the core system department have developed a customized version of nginx server (Tengine) on Taobao as a web server layer deployment on the unified access layer.

Network traffic support

After unified access to the cache layer, because the cache information of each system is centralized and the access is centralized, in terms of network deployment level, the 10 Gigabit network card configuration can be used to solve the hardware bottleneck; at the same time, the network outlet flow that the cluster needs to support is evaluated to ensure that there is no bottleneck in the internal and external outlets of the computer room; in case of cache failure, the internal flow formed by the request back to the source server is supported 。

Overall deployment plan

Figure 6 is the overall deployment plan, from which we can see:

Figure 6 overall deployment plan

Reconstruction effect

In the first half of 2013, the unified access layer completed the transformation and started the access work of browsing system such as commodity details. After the completion, a layer of centralized cache was added to the original single machine cache mode to solve the problem of horizontal expansion of the cache layer. The use of 10 Gigabit network card effectively solves the network bottleneck of cache layer. Since the unified access layer is independent of application, it can be shared by multiple applications, which greatly reduces the cost of monitoring and maintenance, and improves the quality and efficiency. Of course, this transformation also results in the strong dependence of the application on the link of the cache layer. At the same time, there is also a single point problem in the cache layer. From the static single machine cache mode to the unified access layer, the path is only half. The ultimate goal of all transformation is to realize the CDN static of browsing application by using CDN distributed, regional characteristics and strong traffic capacity system.

The third stage: CDN static

The unified access layer solves the problem of low memory utilization of the single machine cache, and gets rid of the memory size constraints of the single machine cache. In the face of the increase of the number of commodities and the dispersion of commodity hotspots, only those problems that cannot be expanded horizontally can be expanded vertically, which improves the maintainability and scalability of the cache system. After the completion of the system from the stand-alone static cache to the unified access layer architecture transformation, the static page has been placed on the CDN. CDN provides a stronger service capability, which is placed on the node closest to the user. It is the most ideal architecture for caching system unitization. At the same time, it also provides a more reliable and stable guarantee for the double 11 peak traffic and anti attack.

CDN involves three specific technical difficulties.

Overall framework

Based on the above ideas, the overall architecture has been relatively clear. In terms of the scheme, the transformation is carried out from the aspects of cache system, failure mode and dynamic content filling. The overall architecture is shown in Figure 7.

Figure 7 static overall architecture

Cache system

Both the unified access layer and CDN nodes use the web server + cache mode. The domain name corresponding to the static application will be resolved to the CDN and the virtual IP of the unified access layer. After the CDN gets the request, it first reads the local cache, and if the cache fails to hit, it will get to the unified cache layer. The unified access layer processes the request according to the original logic. If the cache fails to hit, it will go back to the source and get the data from the server. At the same time, the unified access layer web server needs to be able to identify whether the user's request is a CDN back to source type or a normal request, so as to avoid repeated access logging and gzip compression.

Cache invalidation

The principle of cache invalidation is similar to that of unified access layer. The failure execution process is roughly as follows: the client request is randomly assigned to a node in the failure center via VIP, then the failure task is sent to the agent, and the agent sends the failure command to the cache server and returns the result, as shown in Figure 8.

Figure 8 cache failure principle

Dynamic content fill

In terms of business, because of the need to switch the local content of the page regularly, ESI and page management are added as dynamic content filling methods in the overall architecture. The cache layer is responsible for parsing the ESI tag back to the source, caching the ESI request, and providing the following features.

Reconstruction effect

Finally, the static architecture based on CDN removes the horizontal expansion bottleneck of the single machine cache. The higher the hit rate and the larger the system capacity, the smaller the cost can be used to support the peak traffic. ESI programming model is introduced to solve this problem The local refresh problem on the page is solved, which supports some special requirements of the dual 11 service that need to switch the page content regularly throughout the network; the static page + weak dependency transformation brings high availability, and finally precipitates a set of application independent cache and failure system. On November 11, 2013, with this set of CDN static architecture, tmall's browsing system, such as product details, smoothly passed the day of creating history. Both the page visit volume (PV) and the page request peak (QPS) reached a new high, while the system itself was very stable and had enough margin to bear a higher level of visit traffic. At the same time, the new deployment model and the cache system based on the regional characteristics of CDN nodes also reduce the impact peak of second level requests and better meet the system stability requirements. In the future, browsing systems similar to tmall can easily complete static transformation and access by referring to this architecture system, and achieve ideal stability and scalability goals.

The author Xu Zhao, Hua mingchanggong, is mainly responsible for the optimization of tmall's detail system architecture. Graduated from computer major of Zhejiang University, I love Java Web technology, pay more attention to server performance optimization, and love the research and sharing of open source technology.

This is the original article of programmer magazine. It can't be reproduced without permission. Please contact market ා CSDN. Net (ා replace with @) for reprint