Ali P8 architect tells you what is a distributed architecture

Ali P8 architect tells you what is a distributed architecture

I. Introduction

We all know that nowadays, large companies such as BAT, various small companies, and even companies that have just switched to the Internet in traditional industries have begun to use distributed architecture. So what is distributed architecture? What are the benefits of distributed architecture? How has the distributed architecture evolved? Which company started the era of distributed architecture? After reading this article, you will get these answers. Let's start the wonderful journey of distributed overview together!

2. the development history of distributed architecture

On 2.14, 1946, it was a romantic Valentine's Day. The world's first electronic digital computer was born at the University of Pennsylvania, and her name was ENIAC. This computer covers an area of 170 square meters, weighs 30 tons, and can perform 5000 addition operations per second.

After the birth of the first electronic computer, it means that a rapidly changing IT era has arrived. The performance of a single computer has been continuously improved, from the earliest 8-bit CPU to the current 64-bit CPU; from the early MB-level memory to the current GB-level memory; from the slow mechanical storage to the current solid-state SSD hard disk storage.

After ENIAC, electronic computers entered the era of mainframes dominated by IBM. On April 7, 1964, under the leadership of Gene Amdal (the father of IBM mainframes, considered one of the greatest computer designers ever), he spent 5 billion U.S. dollars, which lasted three years and was the first The IBM mainframe SYSTEM/360 was born. This allowed IBM to dominate the entire mainframe computer industry in the 1950s and 1960s, laying the foundation for the IBM computer empire. IBM mainframes have supported the US spaceflight mission to the moon, and IBM mainframes have been serving key areas of core industries such as finance. Due to its super computing power and high reliability, even in the context of the rapid development of X86 and cloud computing, IBM's mainframes still firmly occupy a certain high-end market share.

In the 1980s, in the era of mainframe hegemony, computer architecture developed in two directions at the same time:

  • A personal, inexpensive PC based on CISC (computer language instruction set executed by microprocessor) CPU.

  • A small, expensive, small UNIX server for enterprises based on RISC (Reduced Instruction Set Computer) CPU architecture.

I personally invite major BATJ architects to create a community group of Java architects. (Group number: 673043639) is committed to providing a free Java architecture industry communication platform. Through this platform, everyone can learn from each other, improve technology, and improve their own level. One level, successfully leading to the development of Java architecture technology masters or architects

3. a milestone in the development of distributed architecture

Mainframes rely on the powerful computing and I/O processing capabilities, security, stability, etc. of mainframes. For a long time, mainframes have led the development of the computer industry and the field of commercial computing. The centralized computer system architecture has gradually become the mainstream. However, with the development of society, it is increasingly difficult for this structure to adapt to the needs of enterprises, such as:


 ( ) 

There will be a single point of problem. Once the mainframe fails, the entire system will be in an unusable state. For mainframe users, the loss caused by this unavailability is very significant.

Due to the advancement of technology and technological development, the performance of PCs has been continuously improved, so many companies abandon mainframes and use minicomputers and ordinary PCs to build system architectures.

Alibaba s "Go IOE" campaign opens a new era

IOE refers to the high-end storage of IBM minicomputers, Oracle databases, and EMC. Alibaba s 2009 "Go to IOE" strategy and technical director revealed that as of May 17, 2013, Alibaba s last IBM minicomputer was offline on Alipay.

Why go to IOE?

With the rapid development of business, Alibaba's business volume and data volume have exploded, and the traditional centralized Oracle database architecture has encountered a bottleneck in the scalability of the system. Traditional commercial database software (Oracle, DB2) is mostly based on a centralized architecture. The biggest feature of these traditional database software is that all data is concentrated in one database, and it can only rely on large high-end equipment to provide high processing capabilities and Scalability. The scalability of the centralized database mainly adopts the method of scale up, which improves the processing capacity of the system by increasing the CPU, memory, and disk. This centralized database architecture makes the database the bottleneck of the entire system, and it has become increasingly unable to adapt to the computing power requirements of massive data.

4. the significance of distributed systems

The reason for the development of a distributed system architecture is that the stand-alone system has many shortcomings waiting to be resolved as follows:

1. The cost-effectiveness of upgrading the processing power of a single machine is getting lower and lower. We know that the processing power of a stand-alone machine mainly depends on the CPU, memory, and disk. By upgrading hardware to improve performance in this vertical expansion method, the cost will become higher and higher. The price/performance ratio will get lower and lower.

2. There is a bottleneck in the processing capacity of a single machine, and there is a bottleneck in the processing capacity of a stand-alone machine. CPU, memory, and disk will all have their own performance bottlenecks. Even if you are a local tyrant to upgrade the hardware at the cost, the development speed and performance of the hardware are still limited.

3. It is difficult to reach the two indicators of stability and availability. Finally, there is the problem of availability and stability of the stand-alone system. These two indicators are also problems that we urgently need to solve.

5. common concepts of distributed architecture

1. Cluster

Xiaozhang opened a small restaurant. At the beginning, there was only one chef in the restaurant, who chopped vegetables, washed vegetables, prepared ingredients and cooked all the vegetables. Later, because the rice was sweet and delicious, the flow of people increased. One chef was too busy. Xiao Zhang invited two more chefs. At this time, the three chefs fry the same dishes and cook the same dishes, washing dishes, preparing ingredients and cooking dishes. Waiting for work, the relationship between these three chefs is a cluster. This means that when a customer comes, only one of the chefs will serve the customer.

2. Distributed

After another period of time, the business in the store has become more popular. In order to allow the chefs to concentrate on cooking and make the dishes the ultimate, Zhang also hired a side dish to be responsible for cutting, preparing and preparing the dishes. Then the chef and the The relationship between the chefs is distributed, and later one of the side dishes is too busy, so Xiao Zhang invited two side dishes, and the relationship between the three side dishes is also clustered.

3. Node

A node refers to an individual program that can independently complete a set of logic in accordance with a distributed protocol. In a specific project, a node represents a process on an operating system. Then every side dish and chef here is a node.

4. Copy mechanism

Replica/copy refers to the redundancy provided for data or services in a distributed system. Data copy refers to the persistence of the same data on different nodes. When data is lost on a node, the data can be restored from the copy. Data copy is the only way to solve the problem of data loss in a distributed system. Service copy means that multiple nodes provide the same service, and the solution of high service availability is realized through the master-slave relationship.

5. Middleware

Middleware is located outside the service provided by the operating system, but it is not an application. It is a type of software located between the application and the system layer, which is convenient for developers to handle communication, input and output, and allows users to only care about themselves Applied part.

6. Changes in the von Neumann model in the distributed field

The picture above is the classic theory-von Neumann system. The computer hardware consists of five parts: arithmetic unit, controller, memory, input device, and output device. No matter how the architecture changes, the computer still has not jumped out of the scope of the system.

Input device changes

In the distributed system architecture, input devices can be divided into two categories: the first category is multiple nodes connected to each other, and the information from other nodes is received as the input of the node; the other is the traditional human-computer interaction Enter the device.

Change of output device

In the distributed system architecture, output is also divided into two types. One is when a node in the system transmits information to other nodes, the node can be regarded as an output device; the other is an output device for interpersonal interaction in the traditional sense, such as The user's terminal.

Controller changes

In a single machine, the controller refers to the controller in the CPU. In a distributed system, the main function of the controller is to coordinate or control the actions and behaviors between nodes; such as hardware load balancer; LVS soft load; rule server and many more.

Arithmetic unit

In a distributed system, the arithmetic unit is composed of multiple nodes. Use the computing power of multiple nodes to collaborate to complete the entire computing task.


In a distributed system, we need to organize multiple nodes that are responsible for storage functions to form a whole storage; such as database and redis (key-value storage).

7. the difficulties of distributed systems

There is no doubt that distributed systems are more complicated to implement for centralized systems. Distributed systems will be more difficult to understand, design, build, and manage, and mean that the root cause of the application is harder to find.


In a centralized architecture, there are only two results returned by calling an interface, success or failure. But in a distributed architecture, there will be a "timeout" state.

Distributed transaction

This is actually an old-fashioned question. We all know that transaction is the atomic guarantee of a series of operations. In the case of a single machine, we can easily achieve transaction control by relying on the local database connection and components, but in distributed Under the architecture, business atomic operations are likely to be cross-service, which will lead to distributed transactions. For example, operations A and B are operations in the same transaction under different services. A calls B. If A can clearly know whether B is successfully committed to control its submission or rollback, but we know that in distributed system calls There will be a new state that is timeout, that is, A cannot know whether B succeeded or failed. At this time, A commits the local transaction or performs a rollback? This is actually a difficult problem. If you want to enforce transaction consistency, you can use distributed locks, but that will increase the complexity of the system and increase the overhead of the system, and the more services the transaction spans, the more resources it consumes. Larger, lower performance, then the best solution is to avoid distributed transactions. Another solution is the retry mechanism, but if the retry is not the query interface, it will inevitably involve database changes. If the first call is successful but no successful result is returned, then the second call of the caller is to the caller It is still a retry, but at this time it is a repeated call for the called party. For example, A transfers money to B, A-100, B + 100, which will cause A to deduct 100 and B to increase by 200. This result is not what we expect, so we need to do idempotent design in the interface to be written (multiple calls and single calls have the same effect). You can usually set a unique key to query whether it already exists when writing to avoid repeated writing. But a prerequisite of idempotent design is that the service is highly available, otherwise the call cannot return a clear result no matter how retries, the caller will wait forever. Although the number of retries can be limited, this has entered an abnormal state, even In extreme cases, human flesh compensation is needed. In fact, according to the CAP and BASE theory, it is impossible to achieve consistency in a highly available distributed situation, and it is generally a guarantee of final consistency.

Load balancing

In order to achieve high service availability, at least two machines are deployed for each service. Because Internet companies generally use ordinary machines that are not very reliable, and the probability of long-term operation downtime is high, two machines can greatly reduce the unavailability of services Possibility, and large-scale projects often use a dozen or even hundreds to deploy a service. This is not only to ensure the high availability of the service, but also to improve the QPS of the service, but this brings another problem, one request comes to the end Which machine is routed to? There are many routing algorithms, including DNS routing. If the session is on the local machine, it will be routed to a fixed machine based on user id or cookie information. Of course, the application server will be designed to be stateless for the convenience of expansion, and the session will be saved to the dedicated machine. Some session servers do not generally involve the problem of not being able to get the session. Are the routing rules obtained randomly? This is a method, but as far as I know, the actual situation is definitely more complicated than this. It is random within a certain range, but it will also be divided into many domains in a large range. For example, if there are multiple computer rooms in order to ensure more activity in different places, exaggerate The cost of the computer room call is too large, and the service in the same computer room will definitely be preferred. This should be considered with reference to the specific machine distribution.


The data is scattered or copied to different machines, how to ensure the data consistency between the hosts will become a difficult point.

Independence of failure

A distributed system is composed of multiple nodes. There is a possibility that the entire distributed system has problems completely, but in practice, it is more that one node has problems, and other nodes are okay. In this case, we need to consider more comprehensively when implementing a distributed system.

Written at the end: welcome to leave a message to discuss, pay more attention, and continue to update! ! !

To- Java