The general phenomenon is:
An OpenAPI we provide responds quickly and slowly, tens of milliseconds when it is fast, and a few seconds when it is slow.
1. try to solve
Since this is not a business problem, it cannot be directly located. So try to reproduce in the test environment, but unfortunately the test environment is fast.
There is no other way but to bite the bullet.
Halfway through, I was lucky enough to let the operation and maintenance check the response time of OpenAPI in Nginx, and wanted to throw the pot to the network. It turned out to be a face: the log in Nginx indicated that there was indeed a problem with the response time.
In order to understand this problem clearly, I briefly sorted out the calling process:
The entire process can be regarded as a relatively common layered architecture:
The client requests to Nginx;
Nginx loads the back-end web service;
The web service calls the back-end Service service through RPC.
1. Log Dafa
The first thing we think of is to log and record the processing time at the method or interface that may be slow to determine where there is a problem.
But through the call chain just now, the request process is not short. There are many changes involved in adding logs, and in case of missing additions, the problem may not be located; and after the code is changed, it will also involve the issue of release and online.
2. Tool analysis
So the best way is to analyze the problem without changing a line of code. At this time, an agent tool is needed. We chose Alibaba's previously open source Tprofile to use it.
You only need to add -javaagent:/xx/tprofiler.jar to the startup parameters to monitor the time-consuming method you want to monitor, and you can output reports, which is very convenient, does not have any intrusiveness to the code, and has a relatively low performance impact small.
2. the use of tools
Let's briefly show how to use this tool:
The first step is naturally to clone the source code and then package, you can clone the source code that I modified.
Because this project has not been maintained by Ali for many years, and some bugs remain, I fixed a bug that affected the use of it on the original basis, and made some optimizations.
Just execute the following script:
git clone https://github.com/crossoverJie/TProfiler mvn assembly:assembly
After arriving here, the jar package we want to use will be generated in the project's TProfiler/pkg/TProfiler/lib/tprofiler-1.0.1.jar.
Next, you only need to configure the jar package to the startup parameters, and configure a configuration file path at the same time.
I copy the official explanation of this configuration file:
The final startup parameters are as follows:
In order to simulate the slow response of the troubleshooting interface, I implemented an HTTP interface with cicada. Two time-consuming methods are called:
In this way, when I start the application, Tprofile will record the method information it collects in the directory I configured.
After I visit the following interface several times, it will write the detailed response of each method to tprofile.log:
From left to right, each column is represented as:
Thread ID, method stack depth, method number, time-consuming (milliseconds)
But tmethod.log is still empty. At this time, we only need to execute this command to flash the latest method sampling information to the tmethod.log file.
com.taobao.profile.client.TProfilerClient 127.0.0.1 50000 flushmethod
In fact, it visits a service exposed by Tprofile. It will read and parse tprofile.log and write tmethod.log at the same time. The port is the port in the configuration file.
Open tmethod.log again:
Which will record the information of the method:
The number in the first line is the method number, which can be used to check the time-consuming situation of each time in tprofile.log (details);
The number at the end of the line is the line number of the last line of this method in the source code.
In fact, most of the performance analysis is to calculate the average time consumption of a certain method. So you also need to execute the following command to generate the average time consumption of each method through tmethod.log tprofile.log:
tmethod.log topmethod.log topobject.log
print result success
Opening topmethod.log is the average time consumption of all methods:
4 is the number of requests;
205 is the average time;
818 is the total time-consuming.
It is consistent with the actual situation.
Method detail time-consuming
There may be other requirements at this time; for example, what if I want to query all the details of a method and it takes time?
The official doesn t provide it, but it s okay, just a little troublesome.
For example, if I want to view the time-consuming details of selectDB(), I must first know the number of this method, which can be seen in tmethod.log:
The number is 2.
Before we know that tprofile.log records the details, so you can view it with the following command:
grep 2 tprofiler.log
Use the method number 2 in the third column to view the details of each execution.
But this method is obviously not friendly enough, it needs to be artificially filtered out, and there are many steps. So I am also going to add such a function, only need to pass in a method name to query the time-consuming details of all methods collected.
Going back to the previous question, through online analysis of this tool, we got the following results:
Some methods are indeed fast and sometimes slow, but they are all related to the database. Due to the current database pressure, we are going to separate the hot and cold data and separate the database and tables in the following;
Before the first operation has been implemented, change part of the database write operation to asynchronous to reduce the response time;
Consider accessing APM tools like pinpoint.
In fact, there are many tools similar to Tprofile, just find the one that suits you.
Before we use a distributed tracking tool like pinpoint, we should rely heavily on this tool, so we may also do some customization in the follow-up, such as adding some visual interfaces, which can improve the efficiency of troubleshooting.