A case analysis of online PHP service abnormality caused by Opcache

I am participating in the "Nuggets·Starting Plan"

1. Background

2021-05-13 14:10 After the service was launched, a large number of "Call to undefined method" errors began to appear. Through Grafana and checking logs, you can find the following

image

  • The error occurs on five servers
  • The error report of each server lasts for about 3 seconds

2. Problem analysis

1. Launch timeline

  • 14:10:50 Apply to go online
  • 14:10:57 Data preparation completed
  • 14:10:57 Go online
  • 14:10:57 Synchronize directory files to the target server through parallel-rsync
  • 14:11:26 The first server synchronization is complete
  • 14:11:41 The last server synchronization is complete
  • 14:11:41 End of online

2. Server 1 timeline

  • 14:11:35 Code to server synchronization completed
  • 14:11:36 The first error report begins
  • 14:11:37 The last error message ends (1 second, 220 messages in total)

3. Server 2 timeline

  • 14:11:35 Code to server synchronization completed
  • 14:11:35 The first error report begins
  • 14:11:38 The last error report ends (lasts 3 seconds, total 237)

4. Server 3 timeline

  • 14:11:35 Code to server synchronization completed
  • 14:11:36 The first error report begins
  • 14:11:40 The last error report ends (4 seconds in total, 700 in total)

5. Server 4 timeline

  • 14:11:39 Code to server synchronization completed
  • 14:11:45 The first error report begins
  • 14:11:45 The last error message ends (1 second, 196 messages in total)

6. Server 5 timeline

  • 14:11:28 Code to server synchronization completed
  • 14:11:30 The first error report begins
  • 14:11:32 The last error message ends (lasts 2 seconds, 399 messages in total)

3. Problem analysis

1. Complete error message

The error message is calling a method that does not exist

{
    "logtime":"2021-05-13 14:11:45",
    "Mode":"fpm-fcgi",
    "Msg":"Call to undefined method app\\xxx\\services\\xxx\\XXXService::doSomething()",
    "Trace":"\/home\/xxx\/xxxxxx\/xxxx.php(68)\n#0 {main}",
    "Uri":"\/xxx.php?xxxxxxxxxxxxxx",
    "Clientip":"xx.xx.xx.xx"
}

2. Questions and doubts

According to the printed call stack, it was found that it was xxx.phpcalled in . Judging from the Git submission record at that time, this doSomething()method exists. In other words, it is impossible to find this method.

3. Possible reasons

There are two possible reasons for the problem, one is the online problem, and the other is the opcache problem

(1) Online problem

When the online synchronization code is sent to the target machine, the file code of the caller has been synchronized, but doSomething()the file where the method is located has not yet been synchronized

(2) opcache问题

两个文件均已同步到目标机器,但Zend引擎解析代码时,opcache出现了如下的分布情况

image

4、原因分析

(1) 上线问题

通过查看上线脚本,预估上传项目大概需要时间32秒。上线平台日志显示从上线至所有机器同步完成,使用了30秒时间。

  • 问题时间线中,报错都是从文件上传完成后才开始的
  • 实际上传30多秒和预计32秒基本一致

从上述两个结论,可以排除上线问题,即代码文件确实已经全部正常同步完成

(2) opcache问题

① opcache伪代码

$now = time();
if( isOpcached(file) ){
  
    // check opcache code
    if( $now - file.lastUpdatedTime < revalidate_freq ){
      
        // 读opcache
        return getOpcache(file);
    }
}

// 重新解析PHP文件
$result = reParse(file);
writeOpcache(file, $result);
return $result;

② Opcache 执行原理

Failed to save, it is recommended to upload the image file directly

③ opcache中文件的上次更新时间参差不齐

由于同步代码文件到服务器上需要30秒,所以在opcache中每个文件的上次更新时间会存在参差不齐的情况

image

④ 结论

由于opcache中每个文件的上次更新时间参差不齐,所以会出现如下情况

  • Zend引擎在检查A文件的opcache时,发现缓存已过期,所以会解析新的A文件
  • B文件在opcache中的上次更新时间很近,即opcache中B文件的内容还处于有效期内
  • Zend引擎会直接读取opcache中的B文件内容,但是这个内容是旧的

理论上会存在上述这种场景,但是需要测试并复现此场景,如果可以复现,则可以确认是opcache中AB文件的上次更新时间不一致,且不需要重新解析B文件

四、复现opcache导致PHP错误问题

1、测试准备

(1) opcache配置

opcache配置如下,其中有效期为5秒

(2) 7个测试脚本

  • online.sh :上线脚本,用于模拟上线操作
cat TestController.php > /home/TestController.php
cat TestService.php > /home/TestService.php
  • rollback.sh :回滚脚本,用于模拟回滚操作
cat TestControllerOld.php > /home/TestController.php
cat TestServiceOld.php > /home/TestService.php
  • TestControllerOld.php :旧的Controller文件,即回滚后的Controller内容(调用getOpcacheStatus1方法)
<?php

class TestController {

    public function test(){
         phpinfo();
    }

    public function test2(int $number){

	// opcache_invalidate(__FILE__, true);
        $opcache = TestService::getInstance()->getOpcacheStatus1();
	      $this->result = [
	        'number' => $number,
	        'opcache' => $opcache,
    	    'time' => date('Y-m-d H:i:s'),
        ];

    }
}
  • TestController.php :新的Controller文件,即上线后的Controller内容(调用getOpcacheStatus2方法)
<?php

class TestController {

    public function test(){
         phpinfo();
    }

    public function test2(int $number){

	// opcache_invalidate(__FILE__, true);
        $opcache = TestService::getInstance()->getOpcacheStatus2();
	      $this->result = [
	        'number' => $number,
	        'opcache' => $opcache,
    	    'time' => date('Y-m-d H:i:s'),
        ];

    }
}
  • TestServiceOld.php :旧的Service文件,即回滚后的Service内容(没有getOpcacheStatus2方法)
<?php

class TestService
{

        public function getOpcacheStatus1(){
	          return 1;
        }

}
  • TestService.php :新的Service文件,即上线后的Service内容(有getOpcacheStatus2方法)
<?php

class TestService
{

        public function getOpcacheStatus1(){
	          return 1;
        }

        public function getOpcacheStatus2(){
		        sleep(1);
		        return 2;
        }
}
  • loop.php :循环脚本,用于不间断的依次执行 上线 / 回滚 操作
<?php

while(1){
    sleep(1);
    echo "上线\n";
    system('bash ./online.sh');
    sleep(1);
    echo "回滚\n";
    system('bash ./rollback.sh');
}

2、测试计划

(1) 测试方法

通过不间断执行上下线,可以加大opcache中不同文件上次更新时间的差异,在这种高概率的情况下且保持高QPS的访问,就比较容易复现 Call to undefined method 错误

  • 执行 loop 脚本 ,即不间断的依次执行上下线
  • 启动Go脚本并发请求接口 ,QPS 为 100

执行上两步操作,然后观察 tail -f /home/log/sys_fatal.log 日志

(2) 预期结果

日志中出现 Call to undefined method getOpcacheStatus2 的报错

(3) 预期结论

如果出现上述报错则说明PHP自身问题,导致在opcache中,Controller内容已更新,但Service内容未更新

image

3、测试过程

在执行所有待操作后,开始观察日志,很快就不断出现预期中的 Call to undefined method 错误,如下图所示

image

4、测试结论

  • 上线同步文件时间越长,每个文件的上次更新时间差异就越大
  • 每个文件的上次更新时间差异越大,对应到opcache中每个文件的有效期差异就越大
  • opcache中每个文件的有效期差异越大,PHP进程读取到非一致性版本文件内容(A文件中的新内容,B文件中的旧内容)的可能性就越大
  • PHP进程读取到非一致性版本文件内容的可能性越大,出现PHP错误的可能性就越取决于QPS
  • QPS越高(100以上),出现PHP错误的可能性就越大

5、测试总结

在可以确认所有文件是最新的情况下,同步文件需要的时间越长,每个文件的上次更新时间差异越大,上线后由opcache引起PHP错误的可能性就越大

四、问题总结

Since it takes 30 seconds to synchronize files online, the last update time of the caller file and the callee file must be different, and the last update time of the corresponding two files in opcache is also not the same. The QPS can be determined according to the log at that time Being above 100 caused PHP errors caused by opcache to be triggered

image

Guess you like

Origin juejin.im/post/7250375753598435383