By residual principle – Vyankatesh Travels

If you are at least a little interested in computers, then you probably noticed that almost any modern processor has twice as much as the number of “logical processors” than the number of nuclei. And if with nuclei it is even more or less clear-this is the number of individual physical computational units, then what is the meaning of logical nuclei is not so obvious. In Intel processors, logical cores are the implementation of Hyper-Threading technology and the company assures us that one of the advantages of technology is “parallel work with several resource-intensive applications while maintaining the previous level of performance”. But where this performance comes from, if the nuclei are not physical, but logical? And why do vendors leave us the opportunity to disable this technology in the BIOS settings? Let’s try to figure it out.

Acceleration of calculations

Traditionally, methods for improving performance were concentrated on increasing the power of the nuclei themselves. There are three main approaches: an increase in the working frequency of nuclei, the disordered execution of operations and an increase in the size of the cache. Increasing the frequency of the processor is the most obvious option: the processor can perform more operations per second, which means that the user will get his result faster. But in fact, not everything is so simple: processors already work at very high frequencies and often just wait for new data for processing will come from, say, RAM. In order to somehow smooth out this expectation, the processor may try to predict what these data will be and continue calculating on the basis of these assumptions, but then the price of the prediction error will be great – you will have to go back.

There is another technique – "parallelism at the level of teams". Its essence is to use the capabilities of the processor to perform several operations at the same time. But usually the programs are written so that it is not possible to be executed sequentially, so we can’t just run several instructions at the same time. It is necessary to determine the independent in a certain “window” of instructions and send them for execution by one pack.

However, one of the main factors slowing the processor is access to memory. We can perform millions of operations per second, but if you need to get data from the disk, you will have to wait a few milliseconds. This is incommensurable for a long time. Yes, we can turn to RAM by loading data into it in advance, but by the standards of the processor, such an approach is also a long. To deal with this problem, the cache was invented – a very fast memory located directly in the processor. All the data with which the processor works is getting there. After processing, the data remain in the cache until they are replaced by new data. But if the processor needs something with which he recently worked with, then he may well avoid an appeal to RAM. Cache is very expensive, and therefore has a hierarchical structure. L1-Kash is located as close as possible to the nucleus and has the highest speed, but it is not enough-several hundred kilobytes. L2 is more slocked, but it is more, a few megabytes. Usually there is now L3-Cash, and its number is calculated by dozens of megabytes. For comparison, now I am writing this text on a machine, where the speed of access to L1 is about 1 nanosecond, L2 – 3NS, to L3 – 10NS. And for RAM as many as 67NS! As you can see, an increase in the size of the cache is a very useful technique. But there can always be a cache, a situation where there is no necessary data in the cache, and then you will have to fulfill an honest appeal to the memory, so it will not be possible to increase the size of the cache, moreover, it is expensive, it occupies the lion’s share of the place on the crystal and distinguishes a huge amount of heat.

As you can see, none of these techniques gives 100%efficiency, since the possibilities of parallelism at the level of instructions are limited, an increase in the frequency of the processor is leveled by a long access to memory, and an increase in the volume of the cache is a difficult and expensive occupation.

However, while one task is waiting for the implementation of any event, we may well work on another task. Therefore, for the parallel launch of independent parts of the programs, the architects of operating systems came up with the concept of flows. Stream is a certain independent task for which the processor time is periodically distinguished. One or more flows make up the process. The process is the personification of the program: it stores software instructions, information about the user who launched the process and other service information. Streams can have common resources and even exchange information to ensure the consistency of actions. During work, the processor simply from time to time switches between different flows and so, almost simultaneously, performs different programs or even different parts of one program. This technique worked so well that the operating systems themselves became multi -wind. For example, to prepare a video from YouTube to output on the screen, the browser does not need 100% processor time, so it can periodically pay attention to the check if you have a new letter to the mail in a next tab. Later, with the distribution of multi -core processors, this technique only better revealed its potential, since the flows could be performed on different cores.

However, a multi -type approach to writing programs gives not only advantages. One of the negative factors you could already notice: switching between streams. The fact is that for this switching it is necessary to save all the information that will need to be disconnected to resume the work in the future. This operation takes time, albeit small, therefore, if only one core of the processor is available to the application, a single -circuit approach may be faster. But there are exceptions. For example, when one flow of the program cannot continue to work, as it is waiting for some data and the other stream is ready to work. In such cases, the switching fee between flows with more than is compensated by depreciation of data expectation.

In addition, streams must be synchronized and monitor the use of common resources. So, for example, with a simultaneous recording of the same file of two different streams, we risk getting useless mess from different data from each stream. It is necessary to establish a record of the record. Similar problems arise when one stream is already waiting for data from another, “undergoing” stream. Such problems greatly increase the requirements for software developers.

Hyper Threading

Hyper-Threading technology is positioned as a technology https://mexplaycasino.co.uk/ of simultaneous multi-use. Let me remind you that this technology appeared at a time when multi -core processors were not yet so common and the ability to use one physical core, like two logical ones were very spectacular. Two logical nuclei used general computing resources, but possessed their own architectural state. Thus, the condition that had to be maintained to switch flows. As a result, the operating system could plan two flows at once, which promised an increase in performance due to the more complete use of computing resources.

For a better understanding, consider the processor a little closer:

Here we see an honest two -core processor. Each nucleus has Architecture state – part that stores the condition of the processor during calculations is mainly registers. In addition, each nucleus has a set of computational resources – units engaged directly.

Execution Resources includes many different modules. For example, Alu (Arithmetic and Logic Unit), which are engaged in simple arithmetic and logical transformations, FPU (Floating-Point Unit), responsible for working with fractional numbers, ACU (Adress Calculation Unit) required for working with memory addresses. Computing units can be (and usually) much more: for example, there are very specific SIMD modules that can process large data volumes in one processor. But it is important for us that these computing resources are different and there are many of them, for example, the new Ryzen Matisse (AMD processors 2019) 4 ALU. It is obvious that when performing a large monotonous task, some computing resources will simply be idle. So, when working with graphics, we rarely need ALU, since almost all the data of the fractional.

To solve the above problem of mining resources, Intel has proposed a different processor scheme:

Now several conditions are divided by common resources. As a result, while one stream of video games calculates the geometry of the frame, the other flow, using the resources of the same nucleus, can process artificial intelligence. But this will work well only when flows are needed by different resources.

In his description, Intel gives a very clear example:

On the left of the usual dual -core processor. Its nuclei are processed by blue and light blue streams. By a large number of white cells, it is clear that the processor is idle a lot.

On the right is a dual-core processor with Hyper-Threading. The first nucleus processes blue and light blue streams, and the second-strokes blue and light blue. As you can see, white cells (idle) are rare, the load is close to the peak, which means that resources are used more optimally.

Advantages and disadvantages

As you can see, this approach to accelerating the operation of the processor is also not perfect. Summarizing the above, you can note the pros and cons:
+ Large loading of computing resources, minimizing their idle.
+ Amortization of the expectation of events or data with flows.
– Waiting for busy computing resources
– Separation of the cache between a large number of nuclei
– high requirements for programmers, since to obtain advantages in the context of the implementation of one program, a multi -flow implementation of this very program is necessary.

Practical application

To verify the advantages that Hyper-Threading gives, the Linux nucleuscate settings and the line of the Linpack and Prime95 bargains were carried out. Tests were carried out on the Intel Core i5 5200U processor (2 nuclei, 4 streams).

Let’s start with Linux assembly. This is version 5.0.3 with ARCH distribution patches. The GCC 9 compiler was used.2.0. When using only 2 real nuclei, the assembly time was 2 hours 24 minutes. When all 4 streams were used, time was reduced to 1 hour 41 minutes. But the most interesting scenario is the assembly when the Hyper-teading is included in 3 streams: 1 hour 39 minutes. As you can see, the fact that compilation algorithms use a limited set of computing units and this set is poorly divided: Streams need to wait for each other. And the clogging of the cache is already becoming noticeable.

Next, Linpack Xtreme. Standard test (on 3GB), 3 passes. He also surprised: when using 4 flows, the average test time was 80.5 seconds. And when using 2 streams-83.3 seconds. That is, in the case of linear algebra, knocking out the cache with additional flows was not so fatal.

The last is the Prime95 benchmark, which is based on the search for simple numbers. The peculiarity of this test is that a small set of operations is needed to search for simple numbers, but you need to perform these operations a lot. The results of the built -in benchmark for 8192k: 2 flows (2 physical. nuclei) – On average 34.176MS. for a test, and on 4 streams 34.918MS. This is the case when the use of HT is harmful, even despite the fact that the drop in performance is not enough.

In conclusion, I would like to note that most modern techniques for accelerating the work of computers in one way or another rely on the dispolity of tasks in programs. However, not all parts of the programs can be performed in parallel: something will have to be calculated sequentially. In such a situation, the Law of Amdal is suitable for assessing the possibilities of acceleration. Let’s take a look at the ratio of the time necessary for the consistent task using one processor nucleus to the time of partially parallel execution using n nuclei:

Here P is the share of the total volume of calculations, which can be perfectly parallelized. N is the number of nuclei used. O – overhead expenses for the use of multi -seating.

As you can see, acceleration is highly dependent on the volume of tasks that can be parallelized, and with the code that should be performed sequentially, we can do nothing at all. Thus, the possibilities of multi -passage are very limited.

Nevertheless, an ordinary user should not be very worried, since his tasks are often independent and are very well parallelized. For example, video games, one of the most difficult tasks of an ordinary user, successfully use multi -passing: while one part of the program is the geometry of the frame, the other can engage in artificial intelligence, the third to process the sound, the fourth – to pre -load data from the disk, and so on. Here Hyper-Threading users only on hand. When it comes to professional activity, Hyper-Threading can be useless, harmful or simply not applicable, but these are already problems of professionals: you are unlikely to have a fine setting up of the production Real-Time equipment, which needs a guaranteed instant response or any other highly specialized equipment that is engaged in continuous identical calculations, such as routers, such.

P.S. Thank you for your attention. I tried to describe the technology as simple as possible, without technical difficulties, but if you have questions or some criticism, I will be glad to comments.

Leave a Comment Cancel Reply