Play the kernel's "downsizing", you only need to do this!

Author |  Ali wireless entertainment development experts Springs City

Zebian | Tu Min

Outline

Youku Youku Play kernel is developed in a SDK pipeline structure. Youku on it to undertake a rich and flexible business logic, under the shield of the differences at each end of the system, it is a highly reliable, scalable, cross-platform SDK outstanding player.

However, cross-team collaboration and iteration time, but also makes the currently playing bit kernel "bloated." Memory footprint is too high, the use of too many threads, etc. In addition to these issues will affect the user experience outside, but also restricted the number of businesses to achieve a certain degree, such as multi-instance program for a short video. Therefore, an urgent need for kernel modules to conduct a "lightweight" transformation. The objectives are:

1) fewer threads

2) a smaller memory

3) lower power consumption

 

Before the transformation their situations

Yoqoo playback based pipelie kernel implements a framework structure is as follows:

Includes an interface layer, the process reported by Engine commands and messages, the message transparently transmit filter layer, the work module body layer, and the rendering module and data download postprocessing module.

After combing with testing, confirm our kernel threads will play a lot more than some of the open source play a kernel (such as ijkplayer), memory usage and power consumption of video and other data also at a disadvantage compared to competing products. So we need to play our kernel round of upgrades.

 

Transformation process in detail

The direction of our transformation include: threads, memory, power consumption of these three areas. Hope to achieve throughout the play process with a minimum of thread, with minimal memory makes the play still smooth, occupying minimal cpu resources makes the play more lasting.

The strategy adopted is to do "addition." According to the playback process, retain the necessary thread, removing redundant thread, the thread reusable reuse. Then review each retained threads, memory usage and cpu test occupancy rate is in line with expectations, if abnormal then one by one investigation.

▐   thread streamline

The number of threads used by the kernel before optimization nearly 30, compared to other open source players a lot more. Some of them are essential, some other threads can be multiplexed, and some are redundant logic can be directly removed. When combing which threads to leave, we consider the thread "minimum set" a playback needs, some threads should include the following modules:

  • engine: an interface for receiving a command, and a message reporting the core;

  • source: for data read and data driven backward flow pipeline;

  • decoder: a respective audio and video, audio and video data for decoding;

  • consumer: a respective audio and video, synchronization and for rendering;

  • hal buffer: buffer for demultiplexing and status monitoring;

  • ykstream: parsing module and interact with slices and for controlling the downloading module;

  • render: for rendering management.

It can be seen playing with thread process must in fact nine. While other threads in addition to pre-load management, quality monitoring, and captioning play and so will be enabled when needed, the rest can be removed.

Streamline the steps are as follows:

1) remove the excess filter thread

Only when the filter module is used to create, is the message behind the pass-through, a bit redundant, it can be directly removed. The module creates a logical flow moves to prepare the engine, open the message channel between the engine and the module, and a message issued command above reported below without undergoing filter.

2) removal of messaging and clock manager

Reporting channel message before optimization confusing, some reports directly to the Engine, some messaging is reported to be a transfer, and then reported to the engine. This messaging layer logic is somewhat redundant, so the thread is removed, all messages are reported by the engine.

Clock synchronization time as a manager to use, does not need this thread, the thread is present as a timer. Currently the kernel to use the timer on one or two points by other threads logic multiplexing, in addition to reliance on the timer, this thread can be removed.

3) the removal of interface commands and messages reporting thread thread

Interface Layer plus a thread issued a transit command, is designed to interface times out when there is a kernel of forcestop mechanism. After several rounds of optimization, the kernel triggers forcestop greatly reduced, so this thread is a bit redundant, even if the situation will appear stuck, there will anr to replace the original crash, this thread can be removed.

Reporting message to the core thread is a multi-layer message plus the examples reported, in fact, through the code reuse, this thread is not essential, it may be removed.

4) removing the demultiplexing threads and two threads cache

Core data acquisition has been the most logical place bloated, there are five threads to implement this part of the function before optimization. 3 can be optimized to retain the demultiplexing threads and two threads may be removed cache.

5) removing the pre-load manager module, and the subtitle decoder

Preloading Manager will run whether or not to open the case of preloading will run, need to add a switch control, only in the pre-load open.

The subtitles data is primarily read, parsed, and the render, which is different from the audio and video, text information can go directly to parse after reading, the subtitle decoder module may be removed.

After optimization, the thread must have nine, plus playback quality monitoring, reserved a total of 12 threads. No subtitles video only 10.

▐   memory Crop

There are four main places that consume memory: buffer cache download data, pipe line in the buffer, save msg structure information, and the memory of each class object. Unless the object class do not, otherwise there is not much room for cutting, so cutting it from the cache memory, pipe lines and information storage structure three point of view to implement.

1) investigation and memory usage does not meet the expectations of local

Thread memory data discovery scan, read buffer memory consumption thread a lot higher than the set value. Es sample data for each analysis, it was found in addition to the data portion, but also saved context of a codec, each packet must exist a. Each packet of codec context should be the same, just keep a can. Kernel has been fixed for this part of the irrational logic, memory usage reduced by nearly one third.

2) reducing the cache buffer

Cache buffer compared to competing products set some big, taking into account the download module also has a big buffer, so the kernel buffer can crop, balance Caton data buffer can be set at a lower level.

3) reduce the pipe line memory usage

pipe kernel memory plus the amount of the secondary cache line reaches 3.5M, in addition to the source reconstruction to the secondary cache, coupled with optimization of the pipe buffer pool, this memory can be reduced to 0.5M.

4) optimization data structure portion

Such as configuration information stored AMessage, each AMessage consumes 4k bytes. For the scene hls intelligence files, each record will create a AMessage, so the record will add up to more than 6MB, not including other parts of the use of AMessage. Therefore, we rewrite a similar functional structure to be replaced, keeping the interface aMessage, reducing unnecessary memory internal dissipation opening.

After optimization, core player peak memory has dropped to 1/3 of the original, greatly reducing the number of memory a single instance.

▐   Power Optimization

The main factors affecting consumption are: cpu occupancy rate, duration of network requests, such as power consumption screen and audio equipment. Screen brightness volume, etc. These factors are fixed, the power consumption is reduced when the main cpu utilization and network requests from the long two aspects to consider.

1) reduce unnecessary procedures, cutting excess thread

This part has been completed in thread cutting, not described in detail here.

2) request the network control slot, avoiding long network connection

When a mobile device requests from the network, the network device wifi / 4G will promptly energized, a large part of this consumption. Therefore, read a piece of data chunks and then wait for a better frequent small pieces of request data. Caton consider other factors, the default settings in the kernel cache consume only after less than two-thirds to restart the download.

3) the replacement data storage structure, removing redundant access logic

Investigation found that every time data is written to buffer, cpu will abnormal busy, which is inconsistent with expectations. review the code to find the outliers: We use the vector data stored in the data structure, each time the data is to push to the front, when the vector reach the size of the order of tens of thousands, this operation will be very push_front consumption cpu. The approach is to modify the vector into a list, write data to the tail, read from the header, the problem does not reproduce.

4) omx synchronous calls into asynchronous, reducing time-consuming decoding cpu

Android platform, hardware solutions omx module using a synchronous call default mode. android9.0 the native layer only this model, the cycle will be queue / dequeue operation, cpu exertion. android9.0 and above, native layer provides asynchronous invocation patterns omx, it will only work in decoding module callback call after queue / dequeue completed, so cpu consume less than synchronous. As shown below, the asynchronous significantly sparse than the synchronous number.

5) algorithm to reduce redundant computation speed

review found audioconsumer thread cpu consume a lot more than the audio decoder, do not meet expectations, inspection found that the case does not turn on speed, speed will go correlation arithmetic logic, resulting in abnormal cpu consumption, before and after restoration comparison below:

6) logic core layer implements barrage

Achieve barrage was originally implemented by the application layer view, under the barrage of data is large, very affect power consumption, even ambiguous situation barrage will appear. Therefore considered to be implemented barrage moved kernel layer, a kernel data received barrage implemented render. After verification, the power consumption is optimized elastic curtain decreases 2/3.

After optimization, the average playback operation has cpu occupancy rate is less than 7% (android midrange test), 1080p / 90 minutes of video consumption reduced by 12%, with 30% improvement compared to before optimization.

summary

At this point, play the core before optimization has been greatly compared to the "downsizing". After the thin kernel code logic becomes more clear, concise and more efficient data transfer, which allows students to participate in kernel development may be more concerned about their own business. Memory usage significantly reduced only speak from the perspective of memory, two instances of the core before optimization, you can now create six, greatly broadens the upper boundary of the business logic. Power consumption becomes lower, greatly enhance the user's playback experience.

Note that: Our business complex, involved in the development of the team, there are many, some time after the release iteration, will inevitably make the kernel becomes more and more bloated. So we need to be monitored more latitude memory, power consumption for each of the official version, identify problems immediately modified so that it will not continue to accumulate these problems. Kernel also regular small-scale reconstruction, removal of unreasonable code, unified common logic processing unit, so as to make high-quality and kernel maintained. 

Recommended Reading 

US group a decade, the world's largest takeaway how to support the delivery of one-stop machine-learning platform is to Make?

Bi Bill Gates quit Microsoft board of directors; Apple WWDC, Microsoft Build Assembly are held online instead; Rust 1.42.0 release | Geeks headlines

Tencent combined ACNet mention fine-grained classification, the effect is up to the latest SOTA | CVPR 2020

My favorite cloud IDE recommendation!

advanced features Solidity written contract of intelligence

return E staff to return to work readme: back to work, Wuhan, Hefei fly first, then go back and pick chartered by the company

You look at every point, I seriously as a favorite

Released 1830 original articles · won praise 40000 + · Views 16,540,000 +

Guess you like

Origin blog.csdn.net/csdnnews/article/details/104890426