Jekyll2022-10-06T21:17:09+03:00https://mysqlperf.github.io/feed.xmlMySQL Performance BlogThis is yet another blog on MySQL performance. We are a team of experts from the Databases Services department of Huawei Cloud. We are passionate about MySQL performance and would like to share some of our findings with the world. Opinions are our own.MySQL Performance BlogThe mysterious “sk_run_filter”2022-10-05T15:00:00+03:002022-10-05T15:00:00+03:00https://mysqlperf.github.io/mysql/skrunfilter<h1 id="tldr">TL;DR</h1>
<p>Details:</p>
<ul>
<li><strong>sk_run_filter</strong> is part of the Linux secure computing (<strong>seccomp</strong>), it filters system calls and their parameters</li>
<li><strong>seccomp</strong> is called for each system call made by all programs inside any docker container, including the MySQL server</li>
<li><strong>seccomp</strong> is not JIT-powered in Linux kernel 3.10 - frequent system calls lead to high CPU consumption in <strong>sk_run_filter</strong> function</li>
<li><strong>seccomp</strong> is JIT-powered in Linux kernel 4.19</li>
<li><strong>seccomp</strong> is additionally optimized in kernel 5.15 with bitmap cache</li>
<li>Docker engine uses <strong>seccomp</strong>, it generates BPF program which is loaded by prctl (PR_SET_SECCOMP) in runc (libcontainer/seccomp/patchbpf/enosys_linux.go)</li>
<li>The list of allowed system calls is described as a JSON file, usually as a whitelist - 300+ entries, which generates quite long BPF program</li>
<li>Our MySQL performance drops more than 40% inside the docker container</li>
</ul>
<p>Recommendations:</p>
<ul>
<li>Update to the Linux kernel 5.15 or at least 4.19 if you use docker containers in production</li>
<li>Mitigation for the older Linux kernels: rewrite the docker’s JSON file from whitelist approach (300+ entries) to the blacklist (for us - about 40 entries). It’s faster to block small amount of system calls than to allow a wider range (like the default docker’s filter does)</li>
<li>More complex mitigation for older kernels: use an up-to-date <strong>libseccomp</strong> (version 2.5.4) with binary tree optimization feature. There’re the following options:
<ul>
<li>Recompile docker with the latest <strong>libseccomp</strong> and binary tree optimization patch</li>
<li>Or turn off current docker security feature totally and load the <strong>seccomp’s</strong> BPF program manually as the first binary executed inside the docker container, after that fork other processes</li>
</ul>
</li>
</ul>
<p>Linux kernels 3.10, 4.19 and 5.15 are chosen because they have long term support <a href="https://kernel.org/">by comunity</a> and Huawei’s <a href="https://support.huawei.com/enterprise/ru/software/250798008-ESW2000173842">EulerOS</a>.</p>
<h1 id="problem">Problem</h1>
<p>MySQL server is deployed inside docker container. The server runs Linux kernel 3.10. Under high load the kernel function <strong>sk_run_filter</strong> consumes extraordinary high amount of CPU:</p>
<p><img src="/assets/images/skrunfilter/problem-perf-report.png" alt="problem" /></p>
<p>So, there’s where the story starts!</p>
<h1 id="the-original-bpf">The original BPF</h1>
<p><strong>sk_run_filter</strong> executes a BPF program in the kernel space. The program is intended for the quite simple virtual machine inside the kernel (one CPU register, very restricted instruction set). It’s main original purpose is to run user defined hooks for the network data. The program instructions are stored in the array and attached via setsockopt (SO_ATTACH_FILTER) system call, for example:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">bpfcode</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">{</span> <span class="n">OP_LDH</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">12</span> <span class="p">},</span> <span class="c1">// ldh [12]</span>
<span class="p">{</span> <span class="n">OP_JEQ</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">ETH_P_IP</span> <span class="p">},</span> <span class="c1">// jeq #0x800, L2, L5</span>
<span class="p">{</span> <span class="n">OP_LDB</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">23</span> <span class="p">},</span> <span class="c1">// ldb [23]</span>
<span class="p">{</span> <span class="n">OP_JEQ</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">IPPROTO_TCP</span> <span class="p">},</span> <span class="c1">// jeq #0x6, L4, L5</span>
<span class="p">{</span> <span class="n">OP_RET</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span> <span class="p">},</span> <span class="c1">// ret #0x0</span>
<span class="p">{</span> <span class="n">OP_RET</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">},</span> <span class="c1">// ret #0xffffffff</span>
<span class="p">};</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// ....</span>
<span class="k">if</span> <span class="p">(</span><span class="n">setsockopt</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="n">SOL_SOCKET</span><span class="p">,</span> <span class="n">SO_ATTACH_FILTER</span><span class="p">,</span> <span class="o">&</span><span class="n">bpf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bpf</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">perror</span><span class="p">(</span><span class="s">"setsockopt ATTACH_FILTER"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// ....</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Let’s look inside the kernel source code, version 3.10.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* sk_run_filter - run a filter on a socket
* @skb: buffer to run the filter on
* @fentry: filter to apply
*
* Decode and apply filter instructions to the skb->data.
* Return length to keep, 0 for none. @skb is the data we are
* filtering, @filter is the array of filter instructions.
* Because all jumps are guaranteed to be before last instruction,
* and last instruction guaranteed to be a RET, we dont need to check
* flen. (We used to pass to this function the length of filter)
*/</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="nf">sk_run_filter</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="nc">sock_filter</span> <span class="o">*</span><span class="n">fentry</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">;</span>
<span class="n">u32</span> <span class="n">A</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* Accumulator */</span>
<span class="n">u32</span> <span class="n">X</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="cm">/* Index Register */</span>
<span class="n">u32</span> <span class="n">mem</span><span class="p">[</span><span class="n">BPF_MEMWORDS</span><span class="p">];</span> <span class="cm">/* Scratch Memory Store */</span>
<span class="n">u32</span> <span class="n">tmp</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">k</span><span class="p">;</span>
<span class="cm">/*
* Process array of filter instructions.
*/</span>
<span class="k">for</span> <span class="p">(;;</span> <span class="n">fentry</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="cp">#if defined(CONFIG_X86_32)
#define K (fentry->k)
#else
</span> <span class="k">const</span> <span class="n">u32</span> <span class="n">K</span> <span class="o">=</span> <span class="n">fentry</span><span class="o">-></span><span class="n">k</span><span class="p">;</span>
<span class="cp">#endif
</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">fentry</span><span class="o">-></span><span class="n">code</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">BPF_S_ALU_ADD_X</span><span class="p">:</span>
<span class="n">A</span> <span class="o">+=</span> <span class="n">X</span><span class="p">;</span>
<span class="k">continue</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_S_ALU_ADD_K</span><span class="p">:</span>
<span class="n">A</span> <span class="o">+=</span> <span class="n">K</span><span class="p">;</span>
<span class="k">continue</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_S_ALU_SUB_X</span><span class="p">:</span>
<span class="n">A</span> <span class="o">-=</span> <span class="n">X</span><span class="p">;</span>
<span class="k">continue</span><span class="p">;</span>
<span class="cm">/*
.....
a lot of case statements
.....
*/</span>
<span class="cp">#ifdef CONFIG_SECCOMP_FILTER
</span> <span class="k">case</span> <span class="n">BPF_S_ANC_SECCOMP_LD_W</span><span class="p">:</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">seccomp_bpf_load</span><span class="p">(</span><span class="n">fentry</span><span class="o">-></span><span class="n">k</span><span class="p">);</span>
<span class="k">continue</span><span class="p">;</span>
<span class="cp">#endif
</span> <span class="nl">default:</span>
<span class="n">WARN_RATELIMIT</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">"Unknown code:%u jt:%u tf:%u k:%u</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">fentry</span><span class="o">-></span><span class="n">code</span><span class="p">,</span> <span class="n">fentry</span><span class="o">-></span><span class="n">jt</span><span class="p">,</span>
<span class="n">fentry</span><span class="o">-></span><span class="n">jf</span><span class="p">,</span> <span class="n">fentry</span><span class="o">-></span><span class="n">k</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As we can see, it’s a classic <strong><em>interpreter</em></strong> which executes the BPF, which is very slow. For socket (network subsystem) the JIT compiler was optionally added for some platforms to solve the performance issues, i.e for x86_64 we have one:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef CONFIG_BPF_JIT
</span>
<span class="cm">/* ...... */</span>
<span class="cp">#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
#else
</span><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">bpf_jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sk_filter</span> <span class="o">*</span><span class="n">fp</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">bpf_jit_free</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sk_filter</span> <span class="o">*</span><span class="n">fp</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">}</span>
<span class="cp">#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
#endif
</span>
<span class="cm">/* ...... */</span>
<span class="kt">void</span> <span class="nf">bpf_jit_compile</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sk_filter</span> <span class="o">*</span><span class="n">fp</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* ...... */</span>
<span class="cm">/* JITed image shrinks with every pass and the loop iterates
* until the image stops shrinking. Very large bpf programs
* may converge on the last pass. In such case do one more
* pass to emit the final image
*/</span>
<span class="k">for</span> <span class="p">(</span><span class="n">pass</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">pass</span> <span class="o"><</span> <span class="mi">10</span> <span class="o">||</span> <span class="n">image</span><span class="p">;</span> <span class="n">pass</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">u8</span> <span class="n">seen_or_pass0</span> <span class="o">=</span> <span class="p">(</span><span class="n">pass</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="p">(</span><span class="n">SEEN_XREG</span> <span class="o">|</span> <span class="n">SEEN_DATAREF</span> <span class="o">|</span> <span class="n">SEEN_MEM</span><span class="p">)</span> <span class="o">:</span> <span class="n">seen</span><span class="p">;</span>
<span class="cm">/* no prologue/epilogue for trivial filters (RET something) */</span>
<span class="n">proglen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">prog</span> <span class="o">=</span> <span class="n">temp</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">seen_or_pass0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">EMIT4</span><span class="p">(</span><span class="mh">0x55</span><span class="p">,</span> <span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x89</span><span class="p">,</span> <span class="mh">0xe5</span><span class="p">);</span> <span class="cm">/* push %rbp; mov %rsp,%rbp */</span>
<span class="n">EMIT4</span><span class="p">(</span><span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x83</span><span class="p">,</span> <span class="mh">0xec</span><span class="p">,</span> <span class="mi">96</span><span class="p">);</span> <span class="cm">/* subq $96,%rsp */</span>
<span class="cm">/* note : must save %rbx in case bpf_error is hit */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">seen_or_pass0</span> <span class="o">&</span> <span class="p">(</span><span class="n">SEEN_XREG</span> <span class="o">|</span> <span class="n">SEEN_DATAREF</span><span class="p">))</span>
<span class="n">EMIT4</span><span class="p">(</span><span class="mh">0x48</span><span class="p">,</span> <span class="mh">0x89</span><span class="p">,</span> <span class="mh">0x5d</span><span class="p">,</span> <span class="mh">0xf8</span><span class="p">);</span> <span class="cm">/* mov %rbx, -8(%rbp) */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">seen_or_pass0</span> <span class="o">&</span> <span class="n">SEEN_XREG</span><span class="p">)</span>
<span class="n">CLEAR_X</span><span class="p">();</span> <span class="cm">/* make sure we dont leek kernel memory */</span>
<span class="cm">/* ...... */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">bpf_jit_enable</span> <span class="o">></span> <span class="mi">1</span><span class="p">)</span>
<span class="n">bpf_jit_dump</span><span class="p">(</span><span class="n">flen</span><span class="p">,</span> <span class="n">proglen</span><span class="p">,</span> <span class="n">pass</span><span class="p">,</span> <span class="n">image</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">image</span><span class="p">)</span> <span class="p">{</span>
<span class="n">bpf_flush_icache</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">image</span> <span class="o">+</span> <span class="n">proglen</span><span class="p">);</span>
<span class="n">fp</span><span class="o">-></span><span class="n">bpf_func</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">image</span><span class="p">;</span>
<span class="p">}</span>
<span class="nl">out:</span>
<span class="n">kfree</span><span class="p">(</span><span class="n">addrs</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The typical usage for <strong>SK_RUN_FILTER</strong> is the following:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* sk_filter_trim_cap - run a packet through a socket filter
* @sk: sock associated with &sk_buff
* @skb: buffer to filter
* @cap: limit on how short the eBPF program may trim the packet
*
* Run the filter code and then cut skb->data to correct size returned by
* sk_run_filter. If pkt_len is 0 we toss packet. If skb->len is smaller
* than pkt_len we keep whole skb->data. This is the socket level
* wrapper to sk_run_filter. It returns 0 if the packet should
* be accepted or -EPERM if the packet should be tossed.
*
*/</span>
<span class="kt">int</span> <span class="nf">sk_filter_trim_cap</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">cap</span><span class="p">)</span>
<span class="p">{</span>
<span class="cm">/* ..... */</span>
<span class="n">rcu_read_lock</span><span class="p">();</span>
<span class="n">filter</span> <span class="o">=</span> <span class="n">rcu_dereference</span><span class="p">(</span><span class="n">sk</span><span class="o">-></span><span class="n">sk_filter</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">filter</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">pkt_len</span> <span class="o">=</span> <span class="n">SK_RUN_FILTER</span><span class="p">(</span><span class="n">filter</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span> <span class="c1">//<<-- HERE</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">pkt_len</span> <span class="o">?</span> <span class="n">pskb_trim</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">max</span><span class="p">(</span><span class="n">cap</span><span class="p">,</span> <span class="n">pkt_len</span><span class="p">))</span> <span class="o">:</span> <span class="o">-</span><span class="n">EPERM</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">rcu_read_unlock</span><span class="p">();</span>
<span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The code snippets provided above prove that for x86_64 platform the BPF program is JIT- compiled to the native x86_64 machine code, and the <strong>bpf_func</strong> is invoked. So there should be no <strong>sk_run_filter</strong> in the perf profile. That means that we observe something different inside our current testing environment.</p>
<h1 id="seccomp-bpf">Seccomp BPF</h1>
<p>The secure computing mode is one of the security feature inside the Linux kernel, which provides the ability to filter system calls and its arguments using the BPF mechanism. The BPF program is loaded by <strong>prctl (PR_SET_SECCOMP)</strong> or <strong>seccomp (SECCOMP_SET_MODE_FILTER)</strong> system calls. A simple example of how this works for <strong>prctl</strong> is shown <a href="https://gist.github.com/fntlnz/08ae20befb91befd9a53cd91cdc6d507">here</a>, where the author forbids <em>write</em> system call doing the following:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <errno.h>
#include <linux/audit.h>
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/unistd.h>
#include <stddef.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <unistd.h>
</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">install_filter</span><span class="p">(</span><span class="kt">int</span> <span class="n">nr</span><span class="p">,</span> <span class="kt">int</span> <span class="n">arch</span><span class="p">,</span> <span class="kt">int</span> <span class="n">error</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">filter</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">BPF_STMT</span><span class="p">(</span><span class="n">BPF_LD</span> <span class="o">+</span> <span class="n">BPF_W</span> <span class="o">+</span> <span class="n">BPF_ABS</span><span class="p">,</span> <span class="p">(</span><span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_data</span><span class="p">,</span> <span class="n">arch</span><span class="p">))),</span>
<span class="n">BPF_JUMP</span><span class="p">(</span><span class="n">BPF_JMP</span> <span class="o">+</span> <span class="n">BPF_JEQ</span> <span class="o">+</span> <span class="n">BPF_K</span><span class="p">,</span> <span class="n">arch</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="n">BPF_STMT</span><span class="p">(</span><span class="n">BPF_LD</span> <span class="o">+</span> <span class="n">BPF_W</span> <span class="o">+</span> <span class="n">BPF_ABS</span><span class="p">,</span> <span class="p">(</span><span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_data</span><span class="p">,</span> <span class="n">nr</span><span class="p">))),</span>
<span class="n">BPF_JUMP</span><span class="p">(</span><span class="n">BPF_JMP</span> <span class="o">+</span> <span class="n">BPF_JEQ</span> <span class="o">+</span> <span class="n">BPF_K</span><span class="p">,</span> <span class="n">nr</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">BPF_STMT</span><span class="p">(</span><span class="n">BPF_RET</span> <span class="o">+</span> <span class="n">BPF_K</span><span class="p">,</span> <span class="n">SECCOMP_RET_ERRNO</span> <span class="o">|</span> <span class="p">(</span><span class="n">error</span> <span class="o">&</span> <span class="n">SECCOMP_RET_DATA</span><span class="p">)),</span>
<span class="n">BPF_STMT</span><span class="p">(</span><span class="n">BPF_RET</span> <span class="o">+</span> <span class="n">BPF_K</span><span class="p">,</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">),</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="nc">sock_fprog</span> <span class="n">prog</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">.</span><span class="n">len</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">short</span><span class="p">)(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">filter</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">filter</span><span class="p">[</span><span class="mi">0</span><span class="p">])),</span>
<span class="p">.</span><span class="n">filter</span> <span class="o">=</span> <span class="n">filter</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prctl</span><span class="p">(</span><span class="n">PR_SET_NO_NEW_PRIVS</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span> <span class="p">{</span>
<span class="n">perror</span><span class="p">(</span><span class="s">"prctl(NO_NEW_PRIVS)"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prctl</span><span class="p">(</span><span class="n">PR_SET_SECCOMP</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="o">&</span><span class="n">prog</span><span class="p">))</span> <span class="p">{</span>
<span class="n">perror</span><span class="p">(</span><span class="s">"prctl(PR_SET_SECCOMP)"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"hey there!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="n">install_filter</span><span class="p">(</span><span class="n">__NR_write</span><span class="p">,</span> <span class="n">AUDIT_ARCH_X86_64</span><span class="p">,</span> <span class="n">EPERM</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"something's gonna happen!!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"it will not definitely print this here</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Looking into the <em>perf</em> profile ( gathered with children ), we can see the following stack trace:</p>
<p><img src="/assets/images/skrunfilter/perf-report-with-children.png" alt="perf-report-with-children" /></p>
<p>Let’s look for these symbols inside the <strong>Linux kernel 3.10</strong> source code:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* seccomp_run_filters - evaluates all seccomp filters against @syscall
* @syscall: number of the current system call
*
* Returns valid seccomp BPF response codes.
*/</span>
<span class="k">static</span> <span class="n">u32</span> <span class="nf">seccomp_run_filters</span><span class="p">(</span><span class="kt">int</span> <span class="n">syscall</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">f</span><span class="p">;</span>
<span class="n">u32</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">;</span>
<span class="cm">/* Ensure unexpected behavior doesn't result in failing open. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON</span><span class="p">(</span><span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">))</span>
<span class="k">return</span> <span class="n">SECCOMP_RET_KILL</span><span class="p">;</span>
<span class="cm">/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/</span>
<span class="k">for</span> <span class="p">(</span><span class="n">f</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">;</span> <span class="n">f</span><span class="p">;</span> <span class="n">f</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">u32</span> <span class="n">cur_ret</span> <span class="o">=</span> <span class="n">sk_run_filter</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="n">f</span><span class="o">-></span><span class="n">insns</span><span class="p">);</span> <span class="c1">// <<-- HERE</span>
<span class="k">if</span> <span class="p">((</span><span class="n">cur_ret</span> <span class="o">&</span> <span class="n">SECCOMP_RET_ACTION</span><span class="p">)</span> <span class="o"><</span> <span class="p">(</span><span class="n">ret</span> <span class="o">&</span> <span class="n">SECCOMP_RET_ACTION</span><span class="p">))</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur_ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* ... */</span>
<span class="kt">int</span> <span class="nf">__secure_computing</span><span class="p">(</span><span class="kt">int</span> <span class="n">this_syscall</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">mode</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">exit_sig</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">int</span> <span class="o">*</span><span class="n">syscall</span><span class="p">;</span>
<span class="n">u32</span> <span class="n">ret</span><span class="p">;</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">mode</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">SECCOMP_MODE_STRICT</span><span class="p">:</span>
<span class="n">syscall</span> <span class="o">=</span> <span class="n">mode1_syscalls</span><span class="p">;</span>
<span class="cp">#ifdef CONFIG_COMPAT
</span> <span class="k">if</span> <span class="p">(</span><span class="n">is_compat_task</span><span class="p">())</span>
<span class="n">syscall</span> <span class="o">=</span> <span class="n">mode1_syscalls_32</span><span class="p">;</span>
<span class="cp">#endif
</span> <span class="k">do</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">syscall</span> <span class="o">==</span> <span class="n">this_syscall</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">*++</span><span class="n">syscall</span><span class="p">);</span>
<span class="n">exit_sig</span> <span class="o">=</span> <span class="n">SIGKILL</span><span class="p">;</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">SECCOMP_RET_KILL</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="cp">#ifdef CONFIG_SECCOMP_FILTER
</span> <span class="k">case</span> <span class="n">SECCOMP_MODE_FILTER</span><span class="p">:</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">data</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">pt_regs</span> <span class="o">*</span><span class="n">regs</span> <span class="o">=</span> <span class="n">task_pt_regs</span><span class="p">(</span><span class="n">current</span><span class="p">);</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">seccomp_run_filters</span><span class="p">(</span><span class="n">this_syscall</span><span class="p">);</span> <span class="c1">// <<-- HERE</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">ret</span> <span class="o">&</span> <span class="n">SECCOMP_RET_DATA</span><span class="p">;</span>
<span class="n">ret</span> <span class="o">&=</span> <span class="n">SECCOMP_RET_ACTION</span><span class="p">;</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">SECCOMP_RET_ERRNO</span><span class="p">:</span>
<span class="cm">/* Set the low-order 16-bits as a errno. */</span>
<span class="n">syscall_set_return_value</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">regs</span><span class="p">,</span>
<span class="o">-</span><span class="n">data</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">goto</span> <span class="n">skip</span><span class="p">;</span>
<span class="cm">/* ...... */</span>
<span class="k">case</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">case</span> <span class="n">SECCOMP_RET_KILL</span><span class="p">:</span>
<span class="nl">default:</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">exit_sig</span> <span class="o">=</span> <span class="n">SIGSYS</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif
</span> <span class="nl">default:</span>
<span class="n">BUG</span><span class="p">();</span>
<span class="p">}</span>
<span class="cm">/* ...... */</span>
<span class="p">}</span>
<span class="cm">/* ... */</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">secure_computing</span><span class="p">(</span><span class="kt">int</span> <span class="n">this_syscall</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">test_thread_flag</span><span class="p">(</span><span class="n">TIF_SECCOMP</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">__secure_computing</span><span class="p">(</span><span class="n">this_syscall</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The <strong>secure_computing</strong> for x86_64 platform can be found only in these two files:</p>
<ul>
<li><em>arch/x86/kernel/ptrace.c</em>, let’s look deeper;</li>
<li><em>arch/x86/kernel/vsyscall_64.c</em>, not interesting: it serves only <strong>__NR_gettimeofday</strong>, <strong>__NR_time</strong>, <strong>__NR_getcpu</strong> in function <strong>emulate_vsyscall</strong>.</li>
</ul>
<p>Let’s investigate <em>arch/x86/kernel/ptrace.c</em></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* We must return the syscall number to actually look up in the table.
* This can be -1L to skip running any syscall at all.
*/</span>
<span class="kt">long</span> <span class="nf">syscall_trace_enter</span><span class="p">(</span><span class="k">struct</span> <span class="nc">pt_regs</span> <span class="o">*</span><span class="n">regs</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">long</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">user_exit</span><span class="p">();</span>
<span class="cm">/*
* If we stepped into a sysenter/syscall insn, it trapped in
* kernel mode; do_debug() cleared TF and set TIF_SINGLESTEP.
* If user-mode had set TF itself, then it's still clear from
* do_debug() and we need to set it again to restore the user
* state. If we entered on the slow path, TF was already set.
*/</span>
<span class="k">if</span> <span class="p">(</span><span class="n">test_thread_flag</span><span class="p">(</span><span class="n">TIF_SINGLESTEP</span><span class="p">))</span>
<span class="n">regs</span><span class="o">-></span><span class="n">flags</span> <span class="o">|=</span> <span class="n">X86_EFLAGS_TF</span><span class="p">;</span>
<span class="cm">/* do the secure computing check first */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">secure_computing</span><span class="p">(</span><span class="n">regs</span><span class="o">-></span><span class="n">orig_ax</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// <<--- HERE</span>
<span class="cm">/* seccomp failures shouldn't expose any additional code. */</span>
<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1L</span><span class="p">;</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* ...... */</span>
<span class="nl">out:</span>
<span class="k">return</span> <span class="n">ret</span> <span class="o">?:</span> <span class="n">regs</span><span class="o">-></span><span class="n">orig_ax</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* ... */</span>
<span class="cm">/*
* Register setup:
* rax system call number
* rdi arg0
* rcx return address for syscall/sysret, C arg3
* rsi arg1
* rdx arg2
* r10 arg3 (--> moved to rcx for C)
* r8 arg4
* r9 arg5
* r11 eflags for syscall/sysret, temporary for C
* r12-r15,rbp,rbx saved by C code, not touched.
*
* Interrupts are off on entry.
* Only called from user space.
*
* XXX if we had a free scratch register we could save the RSP into the stack frame
* and report it properly in ps. Unfortunately we haven't.
*
* When user can change the frames always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/</span>
<span class="n">ENTRY</span><span class="p">(</span><span class="n">system_call</span><span class="p">)</span>
<span class="cm">/* .... a lot of assembler code .... */</span>
<span class="cm">/* Do syscall tracing */</span>
<span class="n">tracesys</span><span class="o">:</span>
<span class="cp">#ifdef CONFIG_AUDITSYSCALL
</span> <span class="n">testl</span> <span class="err">$</span><span class="p">(</span><span class="n">_TIF_WORK_SYSCALL_ENTRY</span> <span class="o">&</span> <span class="o">~</span><span class="n">_TIF_SYSCALL_AUDIT</span><span class="p">),</span><span class="n">TI_flags</span><span class="o">+</span><span class="n">THREAD_INFO</span><span class="p">(</span><span class="o">%</span><span class="n">rsp</span><span class="p">,</span><span class="n">RIP</span><span class="o">-</span><span class="n">ARGOFFSET</span><span class="p">)</span>
<span class="n">jz</span> <span class="n">auditsys</span>
<span class="cp">#endif
</span> <span class="n">SAVE_REST</span>
<span class="n">movq</span> <span class="err">$</span><span class="o">-</span><span class="n">ENOSYS</span><span class="p">,</span><span class="n">RAX</span><span class="p">(</span><span class="o">%</span><span class="n">rsp</span><span class="p">)</span> <span class="cm">/* ptrace can change this for a bad syscall */</span>
<span class="n">FIXUP_TOP_OF_STACK</span> <span class="o">%</span><span class="n">rdi</span>
<span class="n">movq</span> <span class="o">%</span><span class="n">rsp</span><span class="p">,</span><span class="o">%</span><span class="n">rdi</span>
<span class="n">call</span> <span class="n">syscall_trace_enter</span> <span class="c1">// <<--- HERE</span>
</code></pre></div></div>
<p>As we can see, the chain is following:</p>
<ul>
<li><strong>system_call</strong></li>
<li><strong>syscall_trace_enter</strong></li>
<li><strong>secure_computing</strong></li>
<li><strong>__secure_computing</strong></li>
<li><strong>seccomp_run_filters</strong></li>
<li><strong>sk_run_filter</strong></li>
</ul>
<p>Definitely this is what we are looking for!</p>
<p><strong>Pay attention!!!</strong> The call is made directly to the <strong>sk_run_filter</strong> <strong><em>interpreter</em></strong> function, not to the JIT compiled <strong>bpf_func</strong>. So it hits the performance badly.</p>
<p>Now let’s compare the implementation of <strong>__secure_computing</strong> with the <strong>Linux kernel 4.19</strong>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">__secure_computing</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">mode</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">mode</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">this_syscall</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">IS_ENABLED</span><span class="p">(</span><span class="n">CONFIG_CHECKPOINT_RESTORE</span><span class="p">)</span> <span class="o">&&</span>
<span class="n">unlikely</span><span class="p">(</span><span class="n">current</span><span class="o">-></span><span class="n">ptrace</span> <span class="o">&</span> <span class="n">PT_SUSPEND_SECCOMP</span><span class="p">))</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">this_syscall</span> <span class="o">=</span> <span class="n">sd</span> <span class="o">?</span> <span class="n">sd</span><span class="o">-></span><span class="n">nr</span> <span class="o">:</span>
<span class="n">syscall_get_nr</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">task_pt_regs</span><span class="p">(</span><span class="n">current</span><span class="p">));</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">mode</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">SECCOMP_MODE_STRICT</span><span class="p">:</span>
<span class="n">__secure_computing_strict</span><span class="p">(</span><span class="n">this_syscall</span><span class="p">);</span> <span class="cm">/* may call do_exit */</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">case</span> <span class="n">SECCOMP_MODE_FILTER</span><span class="p">:</span>
<span class="k">return</span> <span class="n">__seccomp_filter</span><span class="p">(</span><span class="n">this_syscall</span><span class="p">,</span> <span class="n">sd</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span> <span class="c1">// <<-- HERE</span>
<span class="cm">/* Surviving SECCOMP_RET_KILL_* must be proactively impossible. */</span>
<span class="k">case</span> <span class="n">SECCOMP_MODE_DEAD</span><span class="p">:</span>
<span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="n">do_exit</span><span class="p">(</span><span class="n">SIGKILL</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="nl">default:</span>
<span class="n">BUG</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="cp">#ifdef CONFIG_SECCOMP_FILTER
</span><span class="k">static</span> <span class="kt">int</span> <span class="nf">__seccomp_filter</span><span class="p">(</span><span class="kt">int</span> <span class="n">this_syscall</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">bool</span> <span class="n">recheck_after_trace</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">u32</span> <span class="n">filter_ret</span><span class="p">,</span> <span class="n">action</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">match</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">data</span><span class="p">;</span>
<span class="cm">/*
* Make sure that any changes to mode from another thread have
* been seen after TIF_SECCOMP was seen.
*/</span>
<span class="n">rmb</span><span class="p">();</span>
<span class="n">filter_ret</span> <span class="o">=</span> <span class="n">seccomp_run_filters</span><span class="p">(</span><span class="n">sd</span><span class="p">,</span> <span class="o">&</span><span class="n">match</span><span class="p">);</span> <span class="c1">// <<-- HERE</span>
<span class="cm">/* ...... */</span>
<span class="nl">skip:</span>
<span class="n">seccomp_log</span><span class="p">(</span><span class="n">this_syscall</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">match</span> <span class="o">?</span> <span class="n">match</span><span class="o">-></span><span class="n">log</span> <span class="o">:</span> <span class="nb">false</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/**
* seccomp_run_filters - evaluates all seccomp filters against @sd
* @sd: optional seccomp data to be passed to filters
* @match: stores struct seccomp_filter that resulted in the return value,
* unless filter returned SECCOMP_RET_ALLOW, in which case it will
* be unchanged.
*
* Returns valid seccomp BPF response codes.
*/</span>
<span class="cp">#define ACTION_ONLY(ret) ((s32)((ret) & (SECCOMP_RET_ACTION_FULL)))
</span><span class="k">static</span> <span class="n">u32</span> <span class="nf">seccomp_run_filters</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">,</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">**</span><span class="n">match</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="n">sd_local</span><span class="p">;</span>
<span class="n">u32</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">;</span>
<span class="cm">/* Make sure cross-thread synced filter points somewhere sane. */</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span>
<span class="n">READ_ONCE</span><span class="p">(</span><span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">);</span>
<span class="cm">/* Ensure unexpected behavior doesn't result in failing open. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">WARN_ON</span><span class="p">(</span><span class="n">f</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">SECCOMP_RET_KILL_PROCESS</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">sd</span><span class="p">)</span> <span class="p">{</span>
<span class="n">populate_seccomp_data</span><span class="p">(</span><span class="o">&</span><span class="n">sd_local</span><span class="p">);</span>
<span class="n">sd</span> <span class="o">=</span> <span class="o">&</span><span class="n">sd_local</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">f</span><span class="p">;</span> <span class="n">f</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">u32</span> <span class="n">cur_ret</span> <span class="o">=</span> <span class="n">BPF_PROG_RUN</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">prog</span><span class="p">,</span> <span class="n">sd</span><span class="p">);</span> <span class="c1">// <<-- HERE</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ACTION_ONLY</span><span class="p">(</span><span class="n">cur_ret</span><span class="p">)</span> <span class="o"><</span> <span class="n">ACTION_ONLY</span><span class="p">(</span><span class="n">ret</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur_ret</span><span class="p">;</span>
<span class="o">*</span><span class="n">match</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#endif </span><span class="cm">/* CONFIG_SECCOMP_FILTER */</span><span class="cp">
</span>
<span class="cm">/* ...... */</span>
<span class="k">struct</span> <span class="nc">sk_filter</span> <span class="p">{</span>
<span class="n">refcount_t</span> <span class="n">refcnt</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">rcu_head</span> <span class="n">rcu</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">bpf_prog</span> <span class="o">*</span><span class="n">prog</span><span class="p">;</span>
<span class="p">};</span>
<span class="cp">#define BPF_PROG_RUN(filter, ctx) (*(filter)->bpf_func)(ctx, (filter)->insnsi)
</span></code></pre></div></div>
<p><strong>Pay attention!!!</strong> In <strong>Linux kernel 4.19</strong> the <strong>seccomp</strong> is powered by JIT - the <strong>bpf_func</strong> is invoked.</p>
<h1 id="seccomp-in-docker">Seccomp in Docker</h1>
<p>The docker-engine provides the <strong>seccomp</strong> feature. Look into the <a href="https://docs.docker.com/engine/security/seccomp/">official docker documentation</a> for more details. Moreover, the <strong>Openstack</strong> host machine usually has the following JSON file which overrides the docker <strong>seccomp</strong> defaults (anyway, it doesn’t matter too much, because the defaults from docker or Openstack contains the whitelist of about 300+ system calls allowed for the container):</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"defaultAction"</span><span class="p">:</span><span class="w"> </span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"architectures"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"SCMP_ARCH_X86_64"</span><span class="p">,</span><span class="w">
</span><span class="s2">"SCMP_ARCH_X86"</span><span class="p">,</span><span class="w">
</span><span class="s2">"SCMP_ARCH_X32"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"syscalls"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"io_submit"</span><span class="p">,</span><span class="w">
</span><span class="nl">"action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"SCMP_ACT_ALLOW"</span><span class="p">,</span><span class="w">
</span><span class="nl">"priority"</span><span class="p">:</span><span class="w"> </span><span class="mi">254</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="err">#</span><span class="w"> </span><span class="err">for</span><span class="w"> </span><span class="err">each</span><span class="w"> </span><span class="err">allowed</span><span class="w"> </span><span class="err">system</span><span class="w"> </span><span class="err">call</span><span class="w"> </span><span class="err">there's</span><span class="w"> </span><span class="err">an</span><span class="w"> </span><span class="err">entry</span><span class="w"> </span><span class="err">here</span><span class="w"> </span><span class="err">=></span><span class="w"> </span><span class="mi">300</span><span class="err">+</span><span class="w"> </span><span class="err">entries</span><span class="w"> </span><span class="err">#</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Note: for our research we’ve got the default JSON configuration files (from Docker and Openstack), your production configuration might be different.</p>
<p>The docker might upload BPF program to the Linux kernel using either <strong>prctl</strong> (older kernels) or <strong>seccomp</strong> (recent) system calls, look at the docker’s runc component (<a href="https://github.com/opencontainers/runc/blob/main/libcontainer/seccomp/patchbpf/enosys_linux.go">libcontainer/seccomp/patchbpf/enosys_linux.go</a>)</p>
<p><img src="/assets/images/skrunfilter/runc-seccompsetfilter.png" alt="runc-seccompsetfilter" /></p>
<p>The BPF program is constructed on the fly from the JSON file, the following snippets of the code proves it:</p>
<p><img src="/assets/images/skrunfilter/runc-patchandload.png" alt="runc-patchandload" /></p>
<p><img src="/assets/images/skrunfilter/runc-patchfilter.png" alt="runc-patchfilter" /></p>
<p><img src="/assets/images/skrunfilter/runc-generatepatch.png" alt="runc-generatepatch" /></p>
<p><img src="/assets/images/skrunfilter/runc-generatestub.png" alt="runc-generatestub" /></p>
<p>The BPF commands are packed in the C-struct which resembles the kernel structure <strong>sock_filter</strong>:</p>
<p><img src="/assets/images/skrunfilter/runc-assemble.png" alt="runc-assemble" /></p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
* Try and keep these values and structures similar to BSD, especially
* the BPF code definitions which need to match so you can share filters
*/</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="p">{</span> <span class="cm">/* Filter block */</span>
<span class="n">__u16</span> <span class="n">code</span><span class="p">;</span> <span class="cm">/* Actual filter code */</span>
<span class="n">__u8</span> <span class="n">jt</span><span class="p">;</span> <span class="cm">/* Jump true */</span>
<span class="n">__u8</span> <span class="n">jf</span><span class="p">;</span> <span class="cm">/* Jump false */</span>
<span class="n">__u32</span> <span class="n">k</span><span class="p">;</span> <span class="cm">/* Generic multiuse field */</span>
<span class="p">};</span>
<span class="k">struct</span> <span class="nc">sock_fprog</span> <span class="p">{</span> <span class="cm">/* Required for SO_ATTACH_FILTER. */</span>
<span class="kt">unsigned</span> <span class="kt">short</span> <span class="n">len</span><span class="p">;</span> <span class="cm">/* Number of filter blocks */</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">__user</span> <span class="o">*</span><span class="n">filter</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<h1 id="reproducing-the-problem">Reproducing the problem</h1>
<p>As a small reproducer the <strong>pthread_cond_timedwait</strong> system call was chosen, since our MySQL server invokes it quite often. The testing code creates some threads which make very frequent calls to <strong>pthread_cond_timedwait</strong>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
</span>
<span class="n">pthread_mutex_t</span> <span class="n">MUX</span><span class="p">;</span>
<span class="n">pthread_cond_t</span> <span class="n">COND</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">finish</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="cp">#define TH_NUM 256
#define NSEC 1000
</span><span class="n">pthread_t</span> <span class="n">TH</span><span class="p">[</span><span class="n">TH_NUM</span><span class="p">];</span>
<span class="kt">void</span><span class="o">*</span> <span class="nf">thread_func</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">arg</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="nc">timespec</span> <span class="n">ts</span><span class="p">;</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">MUX</span><span class="p">);</span>
<span class="k">while</span><span class="p">(</span><span class="o">!</span><span class="n">finish</span><span class="p">)</span> <span class="p">{</span>
<span class="n">clock_gettime</span><span class="p">(</span><span class="n">CLOCK_REALTIME</span><span class="p">,</span> <span class="o">&</span><span class="n">ts</span><span class="p">);</span>
<span class="n">ts</span><span class="p">.</span><span class="n">tv_nsec</span> <span class="o">+=</span> <span class="n">NSEC</span><span class="p">;</span>
<span class="n">pthread_cond_timedwait</span><span class="p">(</span><span class="o">&</span><span class="n">COND</span><span class="p">,</span> <span class="o">&</span><span class="n">MUX</span><span class="p">,</span> <span class="o">&</span><span class="n">ts</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">MUX</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Finished</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">pthread_mutex_init</span><span class="p">(</span><span class="o">&</span><span class="n">MUX</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pthread_cond_init</span><span class="p">(</span><span class="o">&</span><span class="n">COND</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">TH_NUM</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="n">pthread_create</span><span class="p">(</span><span class="o">&</span><span class="n">TH</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">thread_func</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">pause</span><span class="p">();</span>
<span class="n">pthread_mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">MUX</span><span class="p">);</span>
<span class="n">finish</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">pthread_mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">MUX</span><span class="p">);</span>
<span class="n">pthread_cond_broadcast</span><span class="p">(</span><span class="o">&</span><span class="n">COND</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Our steps:</p>
<ul>
<li>copy our docker’s default JSON file (<code class="language-plaintext highlighter-rouge">seccomp.json</code>) to a local server,</li>
<li>run the docker engine using <code class="language-plaintext highlighter-rouge">--security-opt seccomp:seccomp.json</code>,</li>
<li>compile the small reproducer (<code class="language-plaintext highlighter-rouge">inf.c</code>, published right ahead),</li>
<li>gather the perf profile,</li>
<li>using Linux kernel 3.10</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>gcc <span class="nt">--version</span>
gcc <span class="o">(</span>GCC<span class="o">)</span> 10.3.0
<span class="o">[</span>host]<span class="nv">$ </span>gcc inf.c <span class="nt">-o</span> inf <span class="nt">-pthread</span>
<span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> seccomp:<span class="nv">$HOME</span>/sk_run_filter/seccomp.json <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span><span class="nb">cd</span> <span class="nv">$HOME</span>/sk_run_filter/
<span class="o">[</span>container]<span class="nv">$ </span>./inf
<span class="o">[</span>host]<span class="nv">$ </span>ps aux | <span class="nb">grep</span> <span class="s1">'./inf'</span>
root 54628 401 0.0 2104832 2496 pts/0 Sl+ 21:40 2:20 ./inf
root 59153 0.0 0.0 112672 952 pts/52 S+ 21:40 0:00 <span class="nb">grep</span> <span class="nt">--color</span><span class="o">=</span>auto ./inf
<span class="o">[</span>host]<span class="nv">$ </span>perf record <span class="nt">-p</span> 54628 <span class="nt">--</span> <span class="nb">sleep </span>10
<span class="o">[</span> perf record: Woken up 33 <span class="nb">times </span>to write data <span class="o">]</span>
<span class="o">[</span> perf record: Captured and wrote 12.998 MB perf.data <span class="o">(</span>335688 samples<span class="o">)</span> <span class="o">]</span>
<span class="o">[</span>host]<span class="nv">$ </span>perf report
</code></pre></div></div>
<p><img src="/assets/images/skrunfilter/repro-perf-report.png" alt="repro-perf-report" /></p>
<p><img src="/assets/images/skrunfilter/repro-top.png" alt="repro-top" /></p>
<p>Aga! <strong>sk_run_filter</strong> showed his true face! Repeat the same steps for the Linux kernel 4.19.</p>
<p><img src="/assets/images/skrunfilter/repro-perf-report-418.png" alt="repro-perf-report-418" /></p>
<p><img src="/assets/images/skrunfilter/repro-top-418.png" alt="repro-top-418" /></p>
<p>As we can see, the bottleneck in Linux 4.19 shifted to the <strong>do_syscall_64</strong>, no more <strong>sk_run_filter</strong> or <strong>__secure_computing</strong>.</p>
<h1 id="mitigation-for-the-linux-kernel-310">Mitigation for the Linux kernel 3.10</h1>
<p>If you look into the Linux kernel 3.10 <a href="https://elixir.bootlin.com/linux/v3.10.108/source/arch/x86/syscalls/syscall_64.tbl">arch/x86/syscalls/syscall_64.tbl</a>, you will see 313 system calls for x86_64 architecture.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#
# 64-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point>
#
# The abi is "common", "64" or "x32" for this file.
#
0 common read sys_read
1 common write sys_write
2 common open sys_open
3 common close sys_close
......
310 64 process_vm_readv sys_process_vm_readv
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 common finit_module sys_finit_module
</code></pre></div></div>
<p>Analyzing our local JSON file and the list of all available system calls, we can conclude that the amount of allowed system calls is roughly 300, while the amount of filtered out - about 40 (the default JSON file might contain system calls for a wide range of Linux kernels: for example, from 3.10 to 5.15, that’s why 300+40 != 313). So, blocking 40+ calls looks better, let’s try.</p>
<p><code class="language-plaintext highlighter-rouge">my.json</code> file:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"defaultAction"</span><span class="p">:</span><span class="w"> </span><span class="s2">"SCMP_ACT_ALLOW"</span><span class="p">,</span><span class="w">
</span><span class="nl">"architectures"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"SCMP_ARCH_X86_64"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"syscalls"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"acct"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"add_key"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"adjtimex"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"afs_syscall"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"clock_adjtime"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"clock_settime"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"create_module"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"delete_module"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"finit_module"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"get_kernel_syms"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"get_mempolicy"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"getpmsg"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"init_module"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"ioperm"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"iopl"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"kcmp"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"kexec_load"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"keyctl"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"mbind"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"migrate_pages"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"mlock"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"mlockall"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"move_pages"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"nfsservctl"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"open_by_handle_at"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"pivot_root"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"process_vm_readv"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"process_vm_writev"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"ptrace"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"putpmsg"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"query_module"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"quotactl"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"request_key"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"security"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"set_mempolicy"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"settimeofday"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"swapoff"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"swapon"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"_sysctl"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"sysfs"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"tuxcall"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"uselib"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"ustat"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"vhangup"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="s2">"vserver"</span><span class="p">,</span><span class="w"> </span><span class="nl">"action"</span><span class="p">:</span><span class="s2">"SCMP_ACT_ERRNO"</span><span class="p">,</span><span class="w"> </span><span class="nl">"priority"</span><span class="p">:</span><span class="mi">1</span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Evaluating under our small reproducer:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> seccomp:<span class="nv">$HOME</span>/sk_run_filter/my.json <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span>./inf
<span class="o">[</span>host]<span class="nv">$ </span>ps aux | <span class="nb">grep</span> <span class="s1">'./inf'</span>
root 45259 325 0.0 2104832 2496 pts/0 Sl+ 22:02 0:26 ./inf
root 46403 0.0 0.0 112668 952 pts/52 S+ 22:02 0:00 <span class="nb">grep</span> <span class="nt">--color</span><span class="o">=</span>auto ./inf
<span class="o">[</span>host]<span class="nv">$ </span>perf record <span class="nt">-p</span> 45259 <span class="nt">--</span> <span class="nb">sleep </span>10
<span class="o">[</span> perf record: Woken up 23 <span class="nb">times </span>to write data <span class="o">]</span>
<span class="o">[</span> perf record: Captured and wrote 11.160 MB perf.data <span class="o">(</span>287512 samples<span class="o">)</span> <span class="o">]</span>
<span class="o">[</span>host]<span class="nv">$ </span>perf report
</code></pre></div></div>
<p><img src="/assets/images/skrunfilter/repro-perf-report-mitigation-blacklist.png" alt="repro-perf-report-mitigation-blacklist" /></p>
<p><img src="/assets/images/skrunfilter/repro-top-mitigation-blacklist.png" alt="repro-top-mitigation-blacklist" /></p>
<p>CPU usage drops from <strong>50%</strong> to <strong>20%</strong>, not that bad for the first try. Meanwhile the question which doesn’t give piece to my mind is what inside the generated BPF program and why it takes so long to execute it?</p>
<h1 id="evaluation-of-bpf-program-generated-by-docker">Evaluation of BPF program generated by docker</h1>
<p>BPF program is stored in the (task_struct*)current->seccomp.filter:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">task_struct</span> <span class="p">{</span>
<span class="k">volatile</span> <span class="kt">long</span> <span class="n">state</span><span class="p">;</span> <span class="cm">/* -1 unrunnable, 0 runnable, >0 stopped */</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">stack</span><span class="p">;</span>
<span class="n">atomic_t</span> <span class="n">usage</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">;</span> <span class="cm">/* per process flags, defined below */</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">ptrace</span><span class="p">;</span>
<span class="cp">#ifdef CONFIG_SMP
</span> <span class="k">struct</span> <span class="nc">llist_node</span> <span class="n">wake_entry</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">on_cpu</span><span class="p">;</span>
<span class="cp">#endif
</span>
<span class="cm">/* .... a lot of fields .... */</span>
<span class="k">struct</span> <span class="nc">seccomp</span> <span class="n">seccomp</span><span class="p">;</span> <span class="c1">//<<-- HERE</span>
<span class="cm">/* .... */</span>
<span class="p">};</span>
<span class="cm">/* ... */</span>
<span class="cm">/**
* struct seccomp - the state of a seccomp'ed process
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
* @filter: The metadata and ruleset for determining what system calls
* are allowed for a task.
*
* @filter must only be accessed from the context of current as there
* is no locking.
*/</span>
<span class="k">struct</span> <span class="nc">seccomp</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">mode</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">filter</span><span class="p">;</span>
<span class="p">};</span>
<span class="cm">/* ... */</span>
<span class="cm">/**
* struct seccomp_filter - container for seccomp BPF programs
*
* @usage: reference count to manage the object lifetime.
* get/put helpers should be used when accessing an instance
* outside of a lifetime-guarded section. In general, this
* is only needed for handling filters shared across tasks.
* @prev: points to a previously installed, or inherited, filter
* @len: the number of instructions in the program
* @insns: the BPF program instructions to evaluate
*
* seccomp_filter objects are organized in a tree linked via the @prev
* pointer. For any task, it appears to be a singly-linked list starting
* with current->seccomp.filter, the most recently attached or inherited filter.
* However, multiple filters may share a @prev node, by way of fork(), which
* results in a unidirectional tree existing in memory. This is similar to
* how namespaces work.
*
* seccomp_filter objects should never be modified after being attached
* to a task_struct (other than @usage).
*/</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="p">{</span>
<span class="n">atomic_t</span> <span class="n">usage</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">short</span> <span class="n">len</span><span class="p">;</span> <span class="cm">/* Instruction count */</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">insns</span><span class="p">[];</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The BPF program itself is stored in <code class="language-plaintext highlighter-rouge">seccomp_filter::insns</code> and the instructions count - in <code class="language-plaintext highlighter-rouge">seccomp::len</code>. Let’s evaluate the size of BPF program generated from our Openstack default docker’s <strong>seccomp</strong> JSON file. To achieve this task, we need to get the access to the current task <code class="language-plaintext highlighter-rouge">task_struct</code>, so it’s required to switch to the kernel space. Let’s create a very simple kernel module which serves the character device. As a working example <a href="https://blog.sourcerer.io/writing-a-simple-linux-kernel-module-d9dc3762c234">this blog</a> was used.</p>
<p>Module source code:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <asm/uaccess.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <uapi/linux/filter.h>
</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="p">{</span>
<span class="n">atomic_t</span> <span class="n">usage</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">short</span> <span class="n">len</span><span class="p">;</span> <span class="cm">/* Instruction count */</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">insns</span><span class="p">[];</span>
<span class="p">};</span>
<span class="n">MODULE_LICENSE</span><span class="p">(</span><span class="s">"GPL"</span><span class="p">);</span>
<span class="n">MODULE_AUTHOR</span><span class="p">(</span><span class="s">"Philimonov Dmitriy"</span><span class="p">);</span>
<span class="n">MODULE_DESCRIPTION</span><span class="p">(</span><span class="s">"Getting the seccomp instructions count"</span><span class="p">);</span>
<span class="n">MODULE_VERSION</span><span class="p">(</span><span class="s">"0.01"</span><span class="p">);</span>
<span class="cp">#define DEVICE_NAME "seccomp_icount"
</span>
<span class="cm">/* Prototypes for device functions */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_open</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_release</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_read</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_write</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">major_num</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">device_open_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">get_seccomp_icount</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">f</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">commands_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">f</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">;</span> <span class="n">f</span><span class="p">;</span> <span class="n">f</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">commands_count</span> <span class="o">+=</span> <span class="n">f</span><span class="o">-></span><span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"Servicing the process pid=%d, seccomp_mode=%d, seccomp_filter=%p, instructions=%lu</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">pid</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">mode</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">,</span>
<span class="n">commands_count</span><span class="p">);</span>
<span class="k">return</span> <span class="n">commands_count</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* This structure points to all of the device functions */</span>
<span class="k">static</span> <span class="k">struct</span> <span class="nc">file_operations</span> <span class="n">file_ops</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">.</span><span class="n">read</span> <span class="o">=</span> <span class="n">device_read</span><span class="p">,</span>
<span class="p">.</span><span class="n">write</span> <span class="o">=</span> <span class="n">device_write</span><span class="p">,</span>
<span class="p">.</span><span class="n">open</span> <span class="o">=</span> <span class="n">device_open</span><span class="p">,</span>
<span class="p">.</span><span class="n">release</span> <span class="o">=</span> <span class="n">device_release</span>
<span class="p">};</span>
<span class="cm">/* When a process reads from our device, this gets called. */</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_read</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">flip</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">char</span> <span class="n">kbuf</span><span class="p">[</span><span class="mi">32</span><span class="p">];</span>
<span class="kt">size_t</span> <span class="n">bytes_written</span><span class="p">,</span> <span class="n">copied</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">offset</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">bytes_written</span> <span class="o">=</span> <span class="n">scnprintf</span><span class="p">(</span><span class="n">kbuf</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="s">"%lu</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">get_seccomp_icount</span><span class="p">());</span>
<span class="n">copied</span> <span class="o">=</span> <span class="n">bytes_written</span> <span class="o"><=</span> <span class="n">len</span> <span class="o">?</span> <span class="n">bytes_written</span> <span class="o">:</span> <span class="n">len</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">copy_to_user</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="n">kbuf</span><span class="p">,</span> <span class="n">copied</span><span class="p">))</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="o">*</span><span class="n">offset</span> <span class="o">+=</span> <span class="n">copied</span><span class="p">;</span>
<span class="k">return</span> <span class="n">copied</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process tries to write to our device */</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_write</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">flip</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* This is a read-only device */</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_ALERT</span> <span class="s">"This operation is not supported.</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process opens our device */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_open</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="n">inode</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* If device is open, return busy */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">device_open_count</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EBUSY</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">device_open_count</span><span class="o">++</span><span class="p">;</span>
<span class="n">try_module_get</span><span class="p">(</span><span class="n">THIS_MODULE</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process closes our device */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_release</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="n">inode</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Decrement the open counter and usage count. Without this, the module would not unload. */</span>
<span class="n">device_open_count</span><span class="o">--</span><span class="p">;</span>
<span class="n">module_put</span><span class="p">(</span><span class="n">THIS_MODULE</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">__init</span> <span class="nf">seccomp_icount_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Try to register character device */</span>
<span class="n">major_num</span> <span class="o">=</span> <span class="n">register_chrdev</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">DEVICE_NAME</span><span class="p">,</span> <span class="o">&</span><span class="n">file_ops</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">major_num</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_ALERT</span> <span class="s">"Could not register device: %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">major_num</span><span class="p">);</span>
<span class="k">return</span> <span class="n">major_num</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"seccomp_icount module loaded with device major number %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">major_num</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">__exit</span> <span class="nf">seccomp_icount_exit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Remember — we have to clean up after ourselves. Unregister the character device. */</span>
<span class="n">unregister_chrdev</span><span class="p">(</span><span class="n">major_num</span><span class="p">,</span> <span class="n">DEVICE_NAME</span><span class="p">);</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"Unregistering seccomp_icount</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/* Register module functions */</span>
<span class="n">module_init</span><span class="p">(</span><span class="n">seccomp_icount_init</span><span class="p">);</span>
<span class="n">module_exit</span><span class="p">(</span><span class="n">seccomp_icount_exit</span><span class="p">);</span>
</code></pre></div></div>
<p>A Makefile for it:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>obj-m +<span class="o">=</span> seccomp_icount.o
all:
make <span class="nt">-C</span> /lib/modules/<span class="si">$(</span>shell <span class="nb">uname</span> <span class="nt">-r</span><span class="si">)</span>/build <span class="nv">M</span><span class="o">=</span><span class="si">$(</span>PWD<span class="si">)</span> modules
clean:
make <span class="nt">-C</span> /lib/modules/<span class="si">$(</span>shell <span class="nb">uname</span> <span class="nt">-r</span><span class="si">)</span>/build <span class="nv">M</span><span class="o">=</span><span class="si">$(</span>PWD<span class="si">)</span> clean
</code></pre></div></div>
<p>Next steps:</p>
<ul>
<li>Compile the module for the Linux kernel 3.10</li>
<li>Insert module into the running kernel</li>
<li>Create a character device</li>
<li>Run docker container with the custom <strong>seccomp</strong> enabled, propagate the custom device to the container</li>
<li>Check the amount of BPF instructions which is executed each system call</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>gcc <span class="nt">--version</span>
gcc <span class="o">(</span>GCC<span class="o">)</span> 4.8.5 20150623 <span class="o">(</span>EulerOS 4.8.5-4<span class="o">)</span>
<span class="o">[</span>host]<span class="nv">$ </span><span class="nb">cd </span>k310.module <span class="o">&&</span> make
<span class="o">[</span>host]<span class="nv">$ </span>dmesg <span class="nt">-C</span> <span class="c"># clear debug ring buffer</span>
<span class="o">[</span>host]<span class="nv">$ </span>insmod seccomp_icount.ko
<span class="o">[</span>host]<span class="nv">$ </span>dmesg <span class="nt">-T</span>
seccomp_icount module loaded with device major number 241
<span class="o">[</span>host]<span class="nv">$ MAJOR</span><span class="o">=</span>241<span class="p">;</span> <span class="nb">sudo mknod</span> /dev/seccomp_icount c <span class="nv">$MAJOR</span> 0
<span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> seccomp:<span class="nv">$HOME</span>/sk_run_filter/seccomp.json <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">--device</span><span class="o">=</span>/dev/seccomp_icount:/dev/seccomp_icount <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span><span class="nb">cat</span> /dev/seccomp_icount
953
<span class="o">[</span>host]<span class="nv">$ </span>dmesg | <span class="nb">tail</span> <span class="nt">-n1</span>
Servicing the process <span class="nv">pid</span><span class="o">=</span>19032, <span class="nv">seccomp_mode</span><span class="o">=</span>2, <span class="nv">seccomp_filter</span><span class="o">=</span>ffff88013721c000, <span class="nv">instructions</span><span class="o">=</span>953
</code></pre></div></div>
<p>So, the <strong>docker-engine::runc</strong> creates the program with about <strong>953</strong> instructions in our case, which are interpreted each system call made by <strong><em>all programs</em></strong> inside docker container including MySQL server.</p>
<p>Using blacklist mitigation (smaller seccomp JSON file):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> seccomp:<span class="nv">$HOME</span>/sk_run_filter/my.json <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">--device</span><span class="o">=</span>/dev/seccomp_icount:/dev/seccomp_icount <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span><span class="nb">cat</span> /dev/seccomp_icount
52
<span class="o">[</span>host]<span class="nv">$ </span>dmesg <span class="nt">-T</span>
Servicing the process <span class="nv">pid</span><span class="o">=</span>42143, <span class="nv">seccomp_mode</span><span class="o">=</span>2, <span class="nv">seccomp_filter</span><span class="o">=</span>ffff8847e525ca00, <span class="nv">instructions</span><span class="o">=</span>52
</code></pre></div></div>
<p>So, the amount of BPF instructions are reduced from <strong>953</strong> to <strong>52</strong>.</p>
<h1 id="impact-on-our-mysql-server">Impact on our MySQL server</h1>
<p>According to my benchmarks, the performance drop for our MySQL server is huge: more than <strong>40%</strong>.
Some benchmark results for <strong>1u4g</strong> cloud instance, configured with CFS quota, data set is 40 tables, 10 millions rows each:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Load type</th>
<th style="text-align: right">Threads</th>
<th style="text-align: right">Performance drop</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">OLTP_PS</td>
<td style="text-align: right">8</td>
<td style="text-align: right">-46.06%</td>
</tr>
<tr>
<td style="text-align: center">OLTP_RO</td>
<td style="text-align: right">8</td>
<td style="text-align: right">-41.76%</td>
</tr>
<tr>
<td style="text-align: center">OLTP_RW</td>
<td style="text-align: right">1</td>
<td style="text-align: right">-24.54%</td>
</tr>
<tr>
<td style="text-align: center">OLTP_UPDATE_INDEX</td>
<td style="text-align: right">1</td>
<td style="text-align: right">-15.98%</td>
</tr>
<tr>
<td style="text-align: center">OLTP_UPDATE_NON_INDEX</td>
<td style="text-align: right">64</td>
<td style="text-align: right">-38.48%</td>
</tr>
</tbody>
</table>
<hr />
<p>The CPU waste (per load type):</p>
<table>
<thead>
<tr>
<th style="text-align: center">OLTP</th>
<th style="text-align: right">CPU kernel 3.10</th>
<th style="text-align: right">CPU kernel 4.18</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">PS/64</td>
<td style="text-align: right">23.62%</td>
<td style="text-align: right">0.03%</td>
</tr>
<tr>
<td style="text-align: center">RO/64</td>
<td style="text-align: right">19.35%</td>
<td style="text-align: right">0.02%</td>
</tr>
<tr>
<td style="text-align: center">RW/64</td>
<td style="text-align: right">17.90%</td>
<td style="text-align: right">0.02%</td>
</tr>
<tr>
<td style="text-align: center">WO/64</td>
<td style="text-align: right">21.56%</td>
<td style="text-align: right">0.05%</td>
</tr>
<tr>
<td style="text-align: center">UPDATE_INDEX/64</td>
<td style="text-align: right">22.45%</td>
<td style="text-align: right">0.03%</td>
</tr>
<tr>
<td style="text-align: center">UPDATE_NON_INDEX/64</td>
<td style="text-align: right">29.93%</td>
<td style="text-align: right">0.04%</td>
</tr>
<tr>
<td style="text-align: center">INSERT/64</td>
<td style="text-align: right">16.80%</td>
<td style="text-align: right">0.05%</td>
</tr>
</tbody>
</table>
<hr />
<h1 id="why-the-execution-of-bpf-program-is-so-sub-optimal">Why the execution of BPF program is so sub-optimal?</h1>
<p>Let’s dump the BPF code from the kernel space. We need to modify our simple kernel module a bit for it (dump the current->seccomp.filter.insns):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <linux/uaccess.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <uapi/linux/filter.h>
</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="p">{</span>
<span class="n">atomic_t</span> <span class="n">usage</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">short</span> <span class="n">len</span><span class="p">;</span> <span class="cm">/* Instruction count */</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="n">insns</span><span class="p">[];</span>
<span class="p">};</span>
<span class="n">MODULE_LICENSE</span><span class="p">(</span><span class="s">"GPL"</span><span class="p">);</span>
<span class="n">MODULE_AUTHOR</span><span class="p">(</span><span class="s">"Philimonov Dmitriy"</span><span class="p">);</span>
<span class="n">MODULE_DESCRIPTION</span><span class="p">(</span><span class="s">"Dumping the seccomp instructions for kernel 3.10"</span><span class="p">);</span>
<span class="n">MODULE_VERSION</span><span class="p">(</span><span class="s">"0.01"</span><span class="p">);</span>
<span class="cp">#define DEVICE_NAME "seccomp_idump"
</span>
<span class="cm">/* Prototypes for device functions */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_open</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_release</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_read</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_write</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">major_num</span><span class="p">;</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">device_open_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">void</span> <span class="nf">print_seccomp_icount</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">f</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">commands_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">f</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">;</span> <span class="n">f</span><span class="p">;</span> <span class="n">f</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">commands_count</span> <span class="o">+=</span> <span class="n">f</span><span class="o">-></span><span class="n">len</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"Servicing the process pid=%d, seccomp_mode=%d, seccomp_filter=%p, instructions=%lu</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">pid</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">mode</span><span class="p">,</span>
<span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">,</span>
<span class="n">commands_count</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/* This structure points to all of the device functions */</span>
<span class="k">static</span> <span class="k">struct</span> <span class="nc">file_operations</span> <span class="n">file_ops</span> <span class="o">=</span> <span class="p">{</span>
<span class="p">.</span><span class="n">read</span> <span class="o">=</span> <span class="n">device_read</span><span class="p">,</span>
<span class="p">.</span><span class="n">write</span> <span class="o">=</span> <span class="n">device_write</span><span class="p">,</span>
<span class="p">.</span><span class="n">open</span> <span class="o">=</span> <span class="n">device_open</span><span class="p">,</span>
<span class="p">.</span><span class="n">release</span> <span class="o">=</span> <span class="n">device_release</span>
<span class="p">};</span>
<span class="cm">/* When a process reads from our device, this gets called. */</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_read</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">flip</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">size_t</span> <span class="n">fp_size</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">copied</span><span class="p">,</span> <span class="n">curr_offset</span> <span class="o">=</span> <span class="o">*</span><span class="n">offset</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">fprog</span> <span class="o">=</span> <span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">fprog</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">fp_size</span> <span class="o">=</span> <span class="n">fprog</span><span class="o">-></span><span class="n">len</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sock_filter</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">curr_offset</span> <span class="o">></span> <span class="n">fp_size</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">left</span> <span class="o">=</span> <span class="n">fp_size</span> <span class="o">-</span> <span class="n">curr_offset</span><span class="p">;</span>
<span class="n">copied</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"Servicing device_read: fprog=%p, fp_size=%lu, offset=%lu, left=%lu, len=%lu</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">fprog</span><span class="p">,</span> <span class="n">fp_size</span><span class="p">,</span> <span class="n">curr_offset</span><span class="p">,</span> <span class="n">left</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">copied</span> <span class="o">||</span> <span class="n">copy_to_user</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">)(</span><span class="n">fprog</span><span class="o">-></span><span class="n">insns</span><span class="p">)</span> <span class="o">+</span> <span class="n">curr_offset</span><span class="p">,</span> <span class="n">copied</span><span class="p">))</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">curr_offset</span> <span class="o">+=</span> <span class="n">copied</span><span class="p">;</span>
<span class="o">*</span><span class="n">offset</span> <span class="o">=</span> <span class="n">curr_offset</span><span class="p">;</span>
<span class="k">return</span> <span class="n">copied</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process tries to write to our device */</span>
<span class="k">static</span> <span class="kt">ssize_t</span> <span class="nf">device_write</span><span class="p">(</span><span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">flip</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buffer</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">offset</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* This is a read-only device */</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_ALERT</span> <span class="s">"This operation is not supported.</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process opens our device */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_open</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="n">inode</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* If device is open, return busy */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">device_open_count</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="o">-</span><span class="n">EBUSY</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">device_open_count</span><span class="o">++</span><span class="p">;</span>
<span class="n">print_seccomp_icount</span><span class="p">();</span>
<span class="n">try_module_get</span><span class="p">(</span><span class="n">THIS_MODULE</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* Called when a process closes our device */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">device_release</span><span class="p">(</span><span class="k">struct</span> <span class="nc">inode</span> <span class="o">*</span><span class="n">inode</span><span class="p">,</span> <span class="k">struct</span> <span class="nc">file</span> <span class="o">*</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Decrement the open counter and usage count. Without this, the module would not unload. */</span>
<span class="n">device_open_count</span><span class="o">--</span><span class="p">;</span>
<span class="n">module_put</span><span class="p">(</span><span class="n">THIS_MODULE</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">int</span> <span class="n">__init</span> <span class="nf">seccomp_idump_init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Try to register character device */</span>
<span class="n">major_num</span> <span class="o">=</span> <span class="n">register_chrdev</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">DEVICE_NAME</span><span class="p">,</span> <span class="o">&</span><span class="n">file_ops</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">major_num</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_ALERT</span> <span class="s">"Could not register device: %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">major_num</span><span class="p">);</span>
<span class="k">return</span> <span class="n">major_num</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"seccomp_idump module loaded with device major number %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">major_num</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kt">void</span> <span class="n">__exit</span> <span class="nf">seccomp_idump_exit</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* Remember — we have to clean up after ourselves. Unregister the character device. */</span>
<span class="n">unregister_chrdev</span><span class="p">(</span><span class="n">major_num</span><span class="p">,</span> <span class="n">DEVICE_NAME</span><span class="p">);</span>
<span class="n">printk</span><span class="p">(</span><span class="n">KERN_INFO</span> <span class="s">"Unregistering seccomp_idump</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/* Register module functions */</span>
<span class="n">module_init</span><span class="p">(</span><span class="n">seccomp_idump_init</span><span class="p">);</span>
<span class="n">module_exit</span><span class="p">(</span><span class="n">seccomp_idump_exit</span><span class="p">);</span>
</code></pre></div></div>
<p>Now we can do the same magic as before:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span><span class="nb">cd </span>k310.idump <span class="o">&&</span> make <span class="o">&&</span> insmod seccomp_idump.ko
<span class="o">[</span>host]<span class="nv">$ </span>dmesg <span class="nt">-T</span>
seccomp_idump module loaded with device major number 237
<span class="o">[</span>host]<span class="nv">$ MAJOR</span><span class="o">=</span>237<span class="p">;</span> <span class="nb">sudo mknod</span> /dev/seccomp_idump c <span class="nv">$MAJOR</span> 0
<span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> seccomp:<span class="nv">$HOME</span>/sk_run_filter/seccomp.json <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">--device</span><span class="o">=</span>/dev/seccomp_idump:/dev/seccomp_idump <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span><span class="nb">cat</span> /dev/seccomp_idump <span class="o">></span> BPF.code
<span class="o">[</span>host]<span class="nv">$ </span>dmesg | <span class="nb">tail
</span>Servicing the process <span class="nv">pid</span><span class="o">=</span>32557, <span class="nv">seccomp_mode</span><span class="o">=</span>2, <span class="nv">seccomp_filter</span><span class="o">=</span>ffff9ada124c0000, <span class="nv">instructions</span><span class="o">=</span>959
Servicing device_read: <span class="nv">fprog</span><span class="o">=</span>ffff9ada124c0000, <span class="nv">fp_size</span><span class="o">=</span>7672, <span class="nv">offset</span><span class="o">=</span>0, <span class="nv">left</span><span class="o">=</span>7672, <span class="nv">len</span><span class="o">=</span>65536
Servicing device_read: <span class="nv">fprog</span><span class="o">=</span>ffff9ada124c0000, <span class="nv">fp_size</span><span class="o">=</span>7672, <span class="nv">offset</span><span class="o">=</span>7672, <span class="nv">left</span><span class="o">=</span>0, <span class="nv">len</span><span class="o">=</span>65536
</code></pre></div></div>
<p>Note: I used a different server/docker, and the amount of instructions changed (953 -> 959), while the JSON file stayed the same.</p>
<p>Now we need to write a simple disassembler for the BPF code, which was just dumped. Using the sources from Linux kernel 3.10 we have something like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdio.h>
#include <stdint.h>
</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="p">{</span> <span class="cm">/* Filter block */</span>
<span class="kt">uint16_t</span> <span class="n">code</span><span class="p">;</span> <span class="cm">/* Actual filter code */</span>
<span class="kt">uint8_t</span> <span class="n">jt</span><span class="p">;</span> <span class="cm">/* Jump true */</span>
<span class="kt">uint8_t</span> <span class="n">jf</span><span class="p">;</span> <span class="cm">/* Jump false */</span>
<span class="kt">uint32_t</span> <span class="n">k</span><span class="p">;</span> <span class="cm">/* Generic multiuse field */</span>
<span class="p">};</span>
<span class="k">using</span> <span class="n">sock_filter_t</span> <span class="o">=</span> <span class="k">struct</span> <span class="nc">sock_filter</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="nf">disassemble_code</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint16_t</span> <span class="n">code</span><span class="p">)</span> <span class="p">{</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">code2str</span><span class="p">[]</span> <span class="p">{</span>
<span class="s">" #0 "</span><span class="p">,</span>
<span class="s">"BPF_S_RET_K "</span><span class="p">,</span>
<span class="s">"BPF_S_RET_A "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_ADD_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_ADD_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_SUB_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_SUB_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_MUL_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_MUL_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_DIV_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_MOD_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_MOD_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_AND_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_AND_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_OR_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_OR_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_XOR_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_XOR_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_LSH_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_LSH_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_RSH_K "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_RSH_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_NEG "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_W_ABS "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_H_ABS "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_B_ABS "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_W_LEN "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_W_IND "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_H_IND "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_B_IND "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_IMM "</span><span class="p">,</span>
<span class="s">"BPF_S_LDX_W_LEN "</span><span class="p">,</span>
<span class="s">"BPF_S_LDX_B_MSH "</span><span class="p">,</span>
<span class="s">"BPF_S_LDX_IMM "</span><span class="p">,</span>
<span class="s">"BPF_S_MISC_TAX "</span><span class="p">,</span>
<span class="s">"BPF_S_MISC_TXA "</span><span class="p">,</span>
<span class="s">"BPF_S_ALU_DIV_K "</span><span class="p">,</span>
<span class="s">"BPF_S_LD_MEM "</span><span class="p">,</span>
<span class="s">"BPF_S_LDX_MEM "</span><span class="p">,</span>
<span class="s">"BPF_S_ST "</span><span class="p">,</span>
<span class="s">"BPF_S_STX "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JA "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JEQ_K "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JEQ_X "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JGE_K "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JGE_X "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JGT_K "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JGT_X "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JSET_K "</span><span class="p">,</span>
<span class="s">"BPF_S_JMP_JSET_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_PROTOCOL "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_PKTTYPE "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_IFINDEX "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_NLATTR "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_NLATTR_NEST "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_MARK "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_QUEUE "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_HATYPE "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_RXHASH "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_CPU "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_ALU_XOR_X "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_SECCOMP_LD_W "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_VLAN_TAG "</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_VLAN_TAG_PRESENT"</span><span class="p">,</span>
<span class="s">"BPF_S_ANC_PAY_OFFSET "</span><span class="p">,</span>
<span class="p">};</span>
<span class="k">static</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">error</span><span class="o">=</span><span class="s">"???"</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">code</span> <span class="o">>=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">code2str</span><span class="p">)</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span><span class="p">))</span>
<span class="k">return</span> <span class="n">error</span><span class="p">;</span>
<span class="k">return</span> <span class="n">code2str</span><span class="p">[</span><span class="n">code</span><span class="p">];</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">disassemble</span><span class="p">(</span><span class="k">const</span> <span class="n">sock_filter_t</span> <span class="o">*</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
<span class="n">disassemble_code</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">code</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%s 0x%04x jt=0x%02x jf=0x%02x k=0x%08x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">disassemble_code</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">code</span><span class="p">),</span> <span class="n">f</span><span class="o">-></span><span class="n">code</span><span class="p">,</span> <span class="n">f</span><span class="o">-></span><span class="n">jt</span><span class="p">,</span> <span class="n">f</span><span class="o">-></span><span class="n">jf</span><span class="p">,</span> <span class="n">f</span><span class="o">-></span><span class="n">k</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="cp">#define BUF_SIZE 512
</span><span class="n">sock_filter_t</span> <span class="n">buffer</span><span class="p">[</span><span class="n">BUF_SIZE</span><span class="p">];</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">FILE</span> <span class="o">*</span><span class="n">ifile</span> <span class="o">=</span> <span class="n">stdin</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">argc</span> <span class="o">>=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">ifilename</span> <span class="o">=</span> <span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">ifile</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="n">ifilename</span><span class="p">,</span> <span class="s">"r"</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">size_t</span> <span class="n">total_processed</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">items</span> <span class="o">=</span> <span class="n">fread</span><span class="p">(</span><span class="o">&</span><span class="n">buffer</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">sock_filter_t</span><span class="p">),</span> <span class="n">BUF_SIZE</span><span class="p">,</span> <span class="n">ifile</span><span class="p">))</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">items</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="n">disassemble</span><span class="p">(</span><span class="n">buffer</span> <span class="o">+</span> <span class="n">i</span><span class="p">);</span>
<span class="n">total_processed</span> <span class="o">+=</span> <span class="n">items</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ifile</span> <span class="o">!=</span> <span class="n">stdin</span><span class="p">)</span> <span class="n">fclose</span><span class="p">(</span><span class="n">ifile</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span>
<span class="s">"=======================================================</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"Processed instructions: %lu</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">total_processed</span>
<span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>So, what is inside?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[host]$ ./dbpf BPF.code
BPF_S_ANC_SECCOMP_LD_W 0x003d jt=0x00 jf=0x00 k=0x00000004
BPF_S_JMP_JEQ_K 0x002a jt=0x01 jf=0x00 k=0xc000003e
BPF_S_JMP_JA 0x0029 jt=0x00 jf=0x00 k=0x00000285
BPF_S_ANC_SECCOMP_LD_W 0x003d jt=0x00 jf=0x00 k=0x00000000
BPF_S_JMP_JEQ_K 0x002a jt=0xb5 jf=0x00 k=0x00000000
BPF_S_JMP_JEQ_K 0x002a jt=0xb4 jf=0x00 k=0x00000001
BPF_S_JMP_JEQ_K 0x002a jt=0xb3 jf=0x00 k=0x00000002
BPF_S_JMP_JEQ_K 0x002a jt=0xb2 jf=0x00 k=0x00000003
BPF_S_JMP_JEQ_K 0x002a jt=0xb1 jf=0x00 k=0x00000004
BPF_S_JMP_JEQ_K 0x002a jt=0xb0 jf=0x00 k=0x00000005
BPF_S_JMP_JEQ_K 0x002a jt=0xaf jf=0x00 k=0x00000006
BPF_S_JMP_JEQ_K 0x002a jt=0xae jf=0x00 k=0x00000007
BPF_S_JMP_JEQ_K 0x002a jt=0xad jf=0x00 k=0x00000008
...
BPF_S_JMP_JEQ_K 0x002a jt=0x08 jf=0x00 k=0x00000174
BPF_S_JMP_JEQ_K 0x002a jt=0x07 jf=0x00 k=0x00000175
BPF_S_JMP_JEQ_K 0x002a jt=0x06 jf=0x00 k=0x00000179
BPF_S_JMP_JEQ_K 0x002a jt=0x00 jf=0x04 k=0x00000088
BPF_S_ANC_SECCOMP_LD_W 0x003d jt=0x00 jf=0x00 k=0x00000010
BPF_S_JMP_JEQ_K 0x002a jt=0x03 jf=0x00 k=0xffffffff
BPF_S_JMP_JEQ_K 0x002a jt=0x02 jf=0x00 k=0x00000008
BPF_S_JMP_JEQ_K 0x002a jt=0x01 jf=0x00 k=0x00000000
BPF_S_RET_K 0x0001 jt=0x00 jf=0x00 k=0x00050001
BPF_S_RET_K 0x0001 jt=0x00 jf=0x00 k=0x7fff0000
BPF_S_RET_K 0x0001 jt=0x00 jf=0x00 k=0x00000000
=======================================================
Processed instructions: 959
</code></pre></div></div>
<p>The pattern shown above repeats 3 times - for each of 3 architectures specified in the original JSON file (“SCMP_ARCH_X86_64”, “SCMP_ARCH_X86” and “SCMP_ARCH_X32”), eventually there’s roughly 300 * 3 = 900, which matches the amount of system calls for Linux kernel 3.10 (313). Let’s try to read this assembler. The input for the BPF program is that structure:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* struct seccomp_data - the format the BPF program executes over.
* @nr: the system call number
* @arch: indicates system call convention as an AUDIT_ARCH_* value
* as defined in <linux/audit.h>.
* @instruction_pointer: at the time of the system call.
* @args: up to 6 system call arguments always stored as 64-bit values
* regardless of the architecture.
*/</span>
<span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">nr</span><span class="p">;</span>
<span class="n">__u32</span> <span class="n">arch</span><span class="p">;</span>
<span class="n">__u64</span> <span class="n">instruction_pointer</span><span class="p">;</span>
<span class="n">__u64</span> <span class="n">args</span><span class="p">[</span><span class="mi">6</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>
<p>So, the first instruction <strong>BPF_S_ANC_SECCOMP_LD_W</strong> reads the <strong>arch</strong> field (offset 4), then checks for the value 0xc000003e (x86_64). If true jump +1 instruction from the current position, so we execute the second <strong>BPF_S_ANC_SECCOMP_LD_W</strong> instruction, which reads the <strong>syscall number</strong> (offset 0, field <strong>nr</strong>). Then there’s the long chain of <strong>BPF_S_JMP_JEQ_K</strong> instructions which check <strong>that syscall number</strong> with the constants 0x1, 0x2, 0x3, 0x4 … and so on (field <strong>k</strong>). If comparison succeeds the jump is done (shift is stored in <strong>jt</strong> field of the instruction), otherwise the next instruction is executed in the chain. Eventually, we have the code like this:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span> <span class="o">=</span> <span class="n">seccomp_data</span><span class="p">.</span><span class="n">arch</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">!=</span> <span class="n">x86_64</span><span class="p">)</span> <span class="k">goto</span> <span class="n">other_arch</span><span class="p">;</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">seccomp_data</span><span class="p">.</span><span class="n">nr</span> <span class="err">#</span> <span class="n">syscall_number</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="n">then</span> <span class="k">goto</span> <span class="n">allow</span><span class="o">-</span><span class="n">label</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span> <span class="n">then</span> <span class="k">goto</span> <span class="n">allow</span><span class="o">-</span><span class="n">lalel</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="mi">3</span><span class="p">)</span> <span class="n">then</span> <span class="k">goto</span> <span class="n">allow</span><span class="o">-</span><span class="n">label</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="mi">4</span><span class="p">)</span> <span class="n">then</span> <span class="k">goto</span> <span class="n">allow</span><span class="o">-</span><span class="n">label</span><span class="p">;</span>
<span class="p">...</span>
<span class="k">if</span> <span class="p">(</span><span class="n">A</span> <span class="o">==</span> <span class="mi">300</span><span class="p">)</span> <span class="n">then</span> <span class="k">goto</span> <span class="n">allow</span><span class="o">-</span><span class="n">label</span><span class="p">;</span>
<span class="n">error</span><span class="o">-</span><span class="n">label</span><span class="o">:</span> <span class="k">return</span> <span class="n">error</span><span class="o">-</span><span class="n">code</span><span class="p">;</span>
<span class="n">allow</span><span class="o">-</span><span class="n">label</span><span class="o">:</span> <span class="k">return</span> <span class="n">allow</span><span class="o">-</span><span class="n">code</span><span class="p">;</span>
<span class="n">other_arch</span><span class="o">:</span> <span class="o"><</span><span class="n">repeat</span> <span class="n">the</span> <span class="n">code</span> <span class="n">pattern</span> <span class="n">again</span><span class="o">></span>
</code></pre></div></div>
<p>As you can see, this is <strong>O(n)</strong> algorithm, which is executed in the BPF interpreter: each virtual instruction is converted to a lot of x86_64 instructions inside that interpreter - a lot of CPU time is wasted.</p>
<h1 id="up-to-date-libseccomp-library-advanced-mitigation-for-linux-kernel-310">Up-to-date libseccomp library (advanced mitigation for Linux kernel 3.10)</h1>
<p>Let’s create the same BPF program directly using <strong>libseccomp</strong>. I’ve just converted our original JSON file used above to the C code using <strong>libseccomp</strong> API and <em>howto</em> examples.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <seccomp.h>
</span>
<span class="cp">#include <stdio.h>
#include <errno.h>
#include <unistd.h>
</span>
<span class="kt">int</span> <span class="n">syscalls</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">io_submit</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">io_getevents</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigaction</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">nanosleep</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sendto</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pread64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pwrite64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">wait4</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">read</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">write</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">close</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">stat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">stat64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mmap</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">munmap</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">open</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fstat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fstat64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lstat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">futex</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">brk</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">clone</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ioctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lseek</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getrusage</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getppid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">select</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">recvfrom</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigprocmask</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mprotect</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">socket</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">connect</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">set_robust_list</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">set_tid_address</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">madvise</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getpriority</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">io_setup</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">openat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getrlimit</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getdents</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">execve</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">access</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">arch_prctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">alarm</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">kill</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">unlink</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pipe</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">creat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigreturn</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fcntl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">geteuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">readlink</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">dup2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">msync</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setsockopt</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rmdir</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">vfork</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getpid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">unlinkat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">uname</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">newfstatat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setrlimit</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">poll</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">umask</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getpgrp</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">recvmsg</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">chmod</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">bind</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">chdir</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">listen</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getcwd</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">faccessat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fadvise64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fadvise64_64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">accept</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getsockname</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getgroups</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">shmctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">shmdt</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">shmat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_getaffinity</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fsync</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">utimensat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">shmget</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">gettid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">clock_gettime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">exit_group</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">socketpair</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">prctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setsid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">io_destroy</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setpriority</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getsid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">restart_syscall</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">accept4</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">capget</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">capset</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">clock_getres</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">clock_nanosleep</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">copy_file_range</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">dup</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">dup3</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_create</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_create1</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_ctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_ctl_old</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_pwait</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_wait</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">epoll_wait_old</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">eventfd</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">eventfd2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">execveat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">exit</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fallocate</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fanotify_mark</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchdir</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchmod</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchmodat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fcntl64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fdatasync</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fgetxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">flistxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">flock</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fork</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fremovexattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fsetxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fstatat64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fstatfs</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fstatfs64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ftruncate</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ftruncate64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">futimesat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getcpu</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getdents64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getegid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getegid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">geteuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getgid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getgroups32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getitimer</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getpeername</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getpgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getrandom</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getresgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getresgid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getresuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getresuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">get_robust_list</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getsockopt</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">get_thread_area</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">gettimeofday</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">getxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">inotify_add_watch</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">inotify_init</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">inotify_init1</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">inotify_rm_watch</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">io_cancel</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ioprio_get</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ioprio_set</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ipc</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lgetxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">link</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">linkat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">listxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">llistxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">_llseek</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lremovexattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lsetxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lstat64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">memfd_create</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mincore</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mkdir</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mkdirat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mknod</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">memfd_create</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mincore</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mkdir</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mkdirat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mknod</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mknodat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mmap2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_getsetattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_notify</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_open</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_timedreceive</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_timedsend</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mq_unlink</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mremap</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">msgctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">msgget</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">msgrcv</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">msgsnd</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">munlock</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">munlockall</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">_newselect</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pause</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pipe2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ppoll</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">preadv</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">prlimit64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pselect6</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">pwritev</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">readahead</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">readlinkat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">readv</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">recv</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">recvmmsg</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">remap_file_pages</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">removexattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rename</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">renameat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">renameat2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigpending</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigqueueinfo</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigsuspend</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_sigtimedwait</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">rt_tgsigqueueinfo</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_getattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_getparam</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_get_priority_max</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_get_priority_min</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_getscheduler</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_rr_get_interval</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_setaffinity</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_setattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_setparam</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_setscheduler</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sched_yield</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">seccomp</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">semctl</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">semget</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">semop</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">semtimedop</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">send</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sendfile</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sendfile64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sendmmsg</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sendmsg</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setfsgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setfsgid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setfsuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setfsuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setgid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setgroups</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setgroups32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setitimer</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setpgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setregid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setregid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setresgid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setresgid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setresuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setresuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setreuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setreuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">set_thread_area</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setuid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setuid32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setxattr</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">shutdown</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sigaltstack</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">signalfd</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">signalfd4</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sigreturn</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">socketcall</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">splice</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">statfs</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">statfs64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">symlink</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">symlinkat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sync</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sync_file_range</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">syncfs</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sysinfo</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">syslog</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">tee</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">tgkill</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">time</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timer_create</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timer_delete</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timerfd_create</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timerfd_gettime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timerfd_settime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timer_getoverrun</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timer_gettime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">timer_settime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">times</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">tkill</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">truncate</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">truncate64</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">ugetrlimit</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">utime</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">utimes</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">vmsplice</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">waitid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">waitpid</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">writev</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">modify_ldt</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">chown</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">chown32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchown</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchown32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchownat</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lchown</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lchown32</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">chroot</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">reboot</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">bpf</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fanotify_init</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">lookup_dcookie</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">mount</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">perf_event_open</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setdomainname</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">sethostname</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">setns</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">umount</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">umount2</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">unshare</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">fchown</span><span class="p">),</span>
<span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">reboot</span><span class="p">),</span>
<span class="p">};</span>
<span class="k">const</span> <span class="kt">size_t</span> <span class="n">syscalls_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">syscalls</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">syscalls</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">scmp_filter_ctx</span> <span class="n">ctx</span> <span class="o">=</span> <span class="n">seccomp_init</span><span class="p">(</span><span class="n">SCMP_ACT_ERRNO</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ctx</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">ENOMEM</span><span class="p">;</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span> <span class="p">}</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_arch_remove</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ARCH_NATIVE</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_arch_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ARCH_X86_64</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="c1">// rc = seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2);</span>
<span class="c1">// if (rc < 0) goto out;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">syscalls_size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">syscalls</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">8</span><span class="p">));</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">4294967295</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">seccomp_export_bpf</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">STDOUT_FILENO</span><span class="p">);</span>
<span class="nl">out:</span>
<span class="n">seccomp_release</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">rc</span> <span class="o">:</span> <span class="n">rc</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note, I’ve commented out the up-to-date binary tree optimization. Let’s check the BPF assembler:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>gcc <span class="nt">-O3</span> <span class="nt">-g</span> genbpf.cc <span class="nt">-I</span> ./libseccomp/include ./libseccomp/lib/libseccomp.a <span class="nt">-o</span> genbpf
<span class="o">[</span>host]<span class="nv">$ </span>./genbpf | <span class="nv">$libseccomp_sources_path</span>/scmp_bpf_disasm
line OP JT JF K
<span class="o">=================================</span>
0000: 0x20 0x00 0x00 0x00000004 ld <span class="nv">$data</span><span class="o">[</span>4]
0001: 0x15 0x00 0x03 0xc000003e jeq 3221225534 <span class="nb">true</span>:0002 <span class="nb">false</span>:0005
0002: 0x20 0x00 0x00 0x00000000 ld <span class="nv">$data</span><span class="o">[</span>0]
0003: 0x35 0x00 0x02 0x40000000 jge 1073741824 <span class="nb">true</span>:0004 <span class="nb">false</span>:0006
0004: 0x15 0x01 0x00 0xffffffff jeq 4294967295 <span class="nb">true</span>:0006 <span class="nb">false</span>:0005
0005: 0x06 0x00 0x00 0x00000000 ret KILL
0006: 0x15 0x1b 0x00 0x00000000 jeq 0 <span class="nb">true</span>:0034 <span class="nb">false</span>:0007
0007: 0x15 0x1a 0x00 0x00000001 jeq 1 <span class="nb">true</span>:0034 <span class="nb">false</span>:0008
0008: 0x15 0x19 0x00 0x00000002 jeq 2 <span class="nb">true</span>:0034 <span class="nb">false</span>:0009
0009: 0x15 0x18 0x00 0x00000003 jeq 3 <span class="nb">true</span>:0034 <span class="nb">false</span>:0010
0010: 0x15 0x17 0x00 0x00000004 jeq 4 <span class="nb">true</span>:0034 <span class="nb">false</span>:0011
0011: 0x15 0x16 0x00 0x00000005 jeq 5 <span class="nb">true</span>:0034 <span class="nb">false</span>:0012
0012: 0x15 0x15 0x00 0x00000006 jeq 6 <span class="nb">true</span>:0034 <span class="nb">false</span>:0013
....
0032: 0x15 0x01 0x00 0x0000001a jeq 26 <span class="nb">true</span>:0034 <span class="nb">false</span>:0033
0033: 0x15 0x00 0x01 0x0000001b jeq 27 <span class="nb">true</span>:0034 <span class="nb">false</span>:0035
0034: 0x06 0x00 0x00 0x7fff0000 ret ALLOW
0035: 0x15 0xff 0x00 0x0000001c jeq 28 <span class="nb">true</span>:0291 <span class="nb">false</span>:0036
0036: 0x15 0xfe 0x00 0x0000001d jeq 29 <span class="nb">true</span>:0291 <span class="nb">false</span>:0037
....
0282: 0x15 0x08 0x00 0x00000146 jeq 326 <span class="nb">true</span>:0291 <span class="nb">false</span>:0283
0283: 0x15 0x00 0x06 0x00000087 jeq 135 <span class="nb">true</span>:0284 <span class="nb">false</span>:0290
0284: 0x20 0x00 0x00 0x00000014 ld <span class="nv">$data</span><span class="o">[</span>20]
0285: 0x15 0x00 0x04 0x00000000 jeq 0 <span class="nb">true</span>:0286 <span class="nb">false</span>:0290
0286: 0x20 0x00 0x00 0x00000010 ld <span class="nv">$data</span><span class="o">[</span>16]
0287: 0x15 0x03 0x00 0xffffffff jeq 4294967295 <span class="nb">true</span>:0291 <span class="nb">false</span>:0288
0288: 0x15 0x02 0x00 0x00000008 jeq 8 <span class="nb">true</span>:0291 <span class="nb">false</span>:0289
0289: 0x15 0x01 0x00 0x00000000 jeq 0 <span class="nb">true</span>:0291 <span class="nb">false</span>:0290
0290: 0x06 0x00 0x00 0x00050001 ret ERRNO<span class="o">(</span>1<span class="o">)</span>
0291: 0x06 0x00 0x00 0x7fff0000 ret ALLOW
0292: 0x06 0x00 0x00 0x00000000 ret KILL
</code></pre></div></div>
<p>The same <strong>O(n)</strong> chain, the assembler here is a bit different, because the Linux kernel makes some modifications inside itself: to the operation codes mostly. Just notice, that all the addresses here are absolute, not relative as it was in kernel BPF version.</p>
<p>Using the binary tree optimization (return back <strong>seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2)</strong>, recompile, disassemble):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>line OP JT JF K
<span class="o">=================================</span>
0000: 0x20 0x00 0x00 0x00000004 ld <span class="nv">$data</span><span class="o">[</span>4]
0001: 0x15 0x00 0x03 0xc000003e jeq 3221225534 <span class="nb">true</span>:0002 <span class="nb">false</span>:0005
0002: 0x20 0x00 0x00 0x00000000 ld <span class="nv">$data</span><span class="o">[</span>0]
0003: 0x35 0x00 0x02 0x40000000 jge 1073741824 <span class="nb">true</span>:0004 <span class="nb">false</span>:0006
0004: 0x15 0x01 0x00 0xffffffff jeq 4294967295 <span class="nb">true</span>:0006 <span class="nb">false</span>:0005
0005: 0x06 0x00 0x00 0x00000000 ret KILL
0006: 0x20 0x00 0x00 0x00000000 ld <span class="nv">$data</span><span class="o">[</span>0]
0007: 0x25 0x01 0x00 0x00000014 jgt 20 <span class="nb">true</span>:0009 <span class="nb">false</span>:0008
0008: 0x05 0x00 0x00 0x00000147 jmp 0336
0009: 0x25 0x00 0xa1 0x0000009d jgt 157 <span class="nb">true</span>:0010 <span class="nb">false</span>:0171
0010: 0x25 0x00 0x4f 0x000000f5 jgt 245 <span class="nb">true</span>:0011 <span class="nb">false</span>:0090
0011: 0x25 0x00 0x27 0x0000011b jgt 283 <span class="nb">true</span>:0012 <span class="nb">false</span>:0051
0012: 0x25 0x00 0x13 0x0000012b jgt 299 <span class="nb">true</span>:0013 <span class="nb">false</span>:0032
0013: 0x25 0x00 0x09 0x0000013a jgt 314 <span class="nb">true</span>:0014 <span class="nb">false</span>:0023
0014: 0x25 0x00 0x04 0x0000013e jgt 318 <span class="nb">true</span>:0015 <span class="nb">false</span>:0019
0015: 0x15 0x5a 0x00 0x00000146 jeq 326 <span class="nb">true</span>:0106 <span class="nb">false</span>:0016
0016: 0x15 0x59 0x00 0x00000142 jeq 322 <span class="nb">true</span>:0106 <span class="nb">false</span>:0017
0017: 0x15 0x58 0x00 0x00000141 jeq 321 <span class="nb">true</span>:0106 <span class="nb">false</span>:0018
0018: 0x15 0x57 0x53 0x0000013f jeq 319 <span class="nb">true</span>:0106 <span class="nb">false</span>:0102
...
0101: 0x15 0x04 0x00 0x000000ea jeq 234 <span class="nb">true</span>:0106 <span class="nb">false</span>:0102
0102: 0x06 0x00 0x00 0x00050001 ret ERRNO<span class="o">(</span>1<span class="o">)</span>
0103: 0x25 0x00 0x05 0x000000e5 jgt 229 <span class="nb">true</span>:0104 <span class="nb">false</span>:0109
0104: 0x15 0x01 0x00 0x000000e9 jeq 233 <span class="nb">true</span>:0106 <span class="nb">false</span>:0105
0105: 0x15 0x00 0x01 0x000000e8 jeq 232 <span class="nb">true</span>:0106 <span class="nb">false</span>:0107
0106: 0x06 0x00 0x00 0x7fff0000 ret ALLOW
0107: 0x15 0xff 0x00 0x000000e7 jeq 231 <span class="nb">true</span>:0363 <span class="nb">false</span>:0108
...
0359: 0x15 0x03 0x00 0x00000002 jeq 2 <span class="nb">true</span>:0363 <span class="nb">false</span>:0360
0360: 0x15 0x02 0x01 0x00000001 jeq 1 <span class="nb">true</span>:0363 <span class="nb">false</span>:0362
0361: 0x15 0x01 0x00 0x00000000 jeq 0 <span class="nb">true</span>:0363 <span class="nb">false</span>:0362
0362: 0x06 0x00 0x00 0x00050001 ret ERRNO<span class="o">(</span>1<span class="o">)</span>
0363: 0x06 0x00 0x00 0x7fff0000 ret ALLOW
0364: 0x06 0x00 0x00 0x00000000 ret KILL
</code></pre></div></div>
<p>As you can see, the algorithm is changed from <strong>O(n)</strong> to <strong>O(log n)</strong>, where n - is the number of system calls to test.</p>
<h1 id="how-to-use-the-custom-libseccomp-advanced-mitigation-for-linux-kernel-310">How to use the custom libseccomp (advanced mitigation for Linux kernel 3.10)</h1>
<p>The easiest way I have found so far is to utilize the <strong>/bin/env</strong> approach: change the environment, then <strong>execve</strong> the child process. Let’s create the <strong>seccomp.bintree</strong> utility, the code above should be slightly modified:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <seccomp.h>
</span>
<span class="cp">#include <stdio.h>
#include <errno.h>
#include <unistd.h>
</span>
<span class="kt">int</span> <span class="n">syscalls</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1">// .... //</span>
<span class="p">};</span>
<span class="k">const</span> <span class="kt">size_t</span> <span class="n">syscalls_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">syscalls</span><span class="p">)</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">syscalls</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="kt">int</span> <span class="nf">apply_seccomp</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">scmp_filter_ctx</span> <span class="n">ctx</span> <span class="o">=</span> <span class="n">seccomp_init</span><span class="p">(</span><span class="n">SCMP_ACT_ERRNO</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ctx</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">ENOMEM</span><span class="p">;</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span> <span class="p">}</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_arch_remove</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ARCH_NATIVE</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_arch_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ARCH_X86_64</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_attr_set</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_FLTATR_CTL_OPTIMIZE</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">syscalls_size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">syscalls</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">0</span><span class="p">));</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">8</span><span class="p">));</span>
<span class="n">rc</span> <span class="o">|=</span> <span class="n">seccomp_rule_add</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">SCMP_ACT_ALLOW</span><span class="p">,</span> <span class="n">SCMP_SYS</span><span class="p">(</span><span class="n">personality</span><span class="p">),</span> <span class="mi">1</span><span class="p">,</span> <span class="n">SCMP_A0</span><span class="p">(</span><span class="n">SCMP_CMP_EQ</span><span class="p">,</span> <span class="mi">4294967295</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="n">rc</span> <span class="o">=</span> <span class="n">seccomp_load</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">perror</span><span class="p">(</span><span class="s">"seccomp_load failed"</span><span class="p">);</span>
<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"SECCOMP APPLIED</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="nl">out:</span>
<span class="n">seccomp_release</span><span class="p">(</span><span class="n">ctx</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">?</span> <span class="o">-</span><span class="n">rc</span> <span class="o">:</span> <span class="n">rc</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">env</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">argc</span> <span class="o"><</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Specify child process"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">apply_seccomp</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rc</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"apply_seccomp() failed</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">execve</span><span class="p">(</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">argv</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">env</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>So, let’s use the benchmark from <strong>Reproducing the problem</strong> section for the <strong>seccomp.bintree</strong> (the code represented above) and <strong>seccomp.default</strong> (without binary tree optimization) and the docker container without the security restrictions at all (we implement the security manually by ourselves):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">[</span>host]<span class="nv">$ </span>docker run <span class="nt">--security-opt</span> <span class="nv">seccomp</span><span class="o">=</span>unconfined <span class="nt">-v</span><span class="nv">$HOME</span>:<span class="nv">$HOME</span> <span class="nt">-it</span> <span class="nv">$euler_os_container</span> /bin/bash
<span class="o">[</span>container]<span class="nv">$ </span>./seccomp.default ./inf
SECCOMP APPLIED
<span class="o">[</span>container]<span class="nv">$ </span>./seccomp.bintree ./inf
SECCOMP APPLIED
</code></pre></div></div>
<p>Then we do the <em>perf record/perf report</em> as usual.</p>
<p><strong>seccomp.default:</strong></p>
<p><img src="/assets/images/skrunfilter/libseccomp-seccomp-default.png" alt="libseccomp-seccomp-default" /></p>
<p><strong>seccomp.bintree:</strong></p>
<p><img src="/assets/images/skrunfilter/libseccomp-seccomp-bintree.png" alt="libseccomp-seccomp-bintree" /></p>
<p><strong>sk_run_filter</strong> CPU consumption in Linux kernel 3.10 dropped from <strong>50%</strong> to <strong>7%</strong> which is like a lot!</p>
<h1 id="a-few-words-about-linux-kernel-515">A few words about Linux kernel 5.15</h1>
<p>The <strong>security computing</strong> feature is further optimized in the Linux kernel 5.15 using bitmap cache. The optimization is done for a whitelist approach. The idea is very simple. When BPF code is uploaded to the kernel (<strong>prctl</strong> or <strong>seccomp</strong> system calls), the bitmap is allocated for each existing system call. If the BPF code for a particular system call always returns the <strong>SECCOMP_RET_ALLOW</strong> regardless its arguments, the corresponding bit is set in the cache. Afterwards, for such system calls the JIT-compiled BPF program isn’t executed at all, the “allow” result is returned immediately.</p>
<p>To my opinion, this final patch eventually solved the original issue completely. As usual, the proof is got from the Linux kernel sources:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* seccomp_run_filters - evaluates all seccomp filters against @sd
* @sd: optional seccomp data to be passed to filters
* @match: stores struct seccomp_filter that resulted in the return value,
* unless filter returned SECCOMP_RET_ALLOW, in which case it will
* be unchanged.
*
* Returns valid seccomp BPF response codes.
*/</span>
<span class="cp">#define ACTION_ONLY(ret) ((s32)((ret) & (SECCOMP_RET_ACTION_FULL)))
</span><span class="k">static</span> <span class="n">u32</span> <span class="nf">seccomp_run_filters</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">,</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">**</span><span class="n">match</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">u32</span> <span class="n">ret</span> <span class="o">=</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">;</span>
<span class="cm">/* Make sure cross-thread synced filter points somewhere sane. */</span>
<span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">f</span> <span class="o">=</span>
<span class="n">READ_ONCE</span><span class="p">(</span><span class="n">current</span><span class="o">-></span><span class="n">seccomp</span><span class="p">.</span><span class="n">filter</span><span class="p">);</span>
<span class="cm">/* Ensure unexpected behavior doesn't result in failing open. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON</span><span class="p">(</span><span class="n">f</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">))</span>
<span class="k">return</span> <span class="n">SECCOMP_RET_KILL_PROCESS</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">seccomp_cache_check_allow</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">sd</span><span class="p">))</span> <span class="c1">// << -- HERE</span>
<span class="k">return</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">;</span>
<span class="cm">/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">f</span><span class="p">;</span> <span class="n">f</span> <span class="o">=</span> <span class="n">f</span><span class="o">-></span><span class="n">prev</span><span class="p">)</span> <span class="p">{</span>
<span class="n">u32</span> <span class="n">cur_ret</span> <span class="o">=</span> <span class="n">bpf_prog_run_pin_on_cpu</span><span class="p">(</span><span class="n">f</span><span class="o">-></span><span class="n">prog</span><span class="p">,</span> <span class="n">sd</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">ACTION_ONLY</span><span class="p">(</span><span class="n">cur_ret</span><span class="p">)</span> <span class="o"><</span> <span class="n">ACTION_ONLY</span><span class="p">(</span><span class="n">ret</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ret</span> <span class="o">=</span> <span class="n">cur_ret</span><span class="p">;</span>
<span class="o">*</span><span class="n">match</span> <span class="o">=</span> <span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
<span class="cm">/* .... */</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">bool</span> <span class="nf">seccomp_cache_check_allow_bitmap</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">bitmap</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">bitmap_size</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">syscall_nr</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">syscall_nr</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">syscall_nr</span> <span class="o">>=</span> <span class="n">bitmap_size</span><span class="p">))</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="n">syscall_nr</span> <span class="o">=</span> <span class="n">array_index_nospec</span><span class="p">(</span><span class="n">syscall_nr</span><span class="p">,</span> <span class="n">bitmap_size</span><span class="p">);</span>
<span class="k">return</span> <span class="n">test_bit</span><span class="p">(</span><span class="n">syscall_nr</span><span class="p">,</span> <span class="n">bitmap</span><span class="p">);</span>
<span class="p">}</span>
<span class="cm">/**
* seccomp_cache_check_allow - lookup seccomp cache
* @sfilter: The seccomp filter
* @sd: The seccomp data to lookup the cache with
*
* Returns true if the seccomp_data is cached and allowed.
*/</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">bool</span> <span class="nf">seccomp_cache_check_allow</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">sfilter</span><span class="p">,</span>
<span class="k">const</span> <span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">syscall_nr</span> <span class="o">=</span> <span class="n">sd</span><span class="o">-></span><span class="n">nr</span><span class="p">;</span>
<span class="k">const</span> <span class="k">struct</span> <span class="nc">action_cache</span> <span class="o">*</span><span class="n">cache</span> <span class="o">=</span> <span class="o">&</span><span class="n">sfilter</span><span class="o">-></span><span class="n">cache</span><span class="p">;</span>
<span class="cp">#ifndef SECCOMP_ARCH_COMPAT
</span> <span class="cm">/* A native-only architecture doesn't need to check sd->arch. */</span>
<span class="k">return</span> <span class="n">seccomp_cache_check_allow_bitmap</span><span class="p">(</span><span class="n">cache</span><span class="o">-></span><span class="n">allow_native</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_NATIVE_NR</span><span class="p">,</span>
<span class="n">syscall_nr</span><span class="p">);</span>
<span class="cp">#else
</span> <span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">sd</span><span class="o">-></span><span class="n">arch</span> <span class="o">==</span> <span class="n">SECCOMP_ARCH_NATIVE</span><span class="p">))</span>
<span class="k">return</span> <span class="n">seccomp_cache_check_allow_bitmap</span><span class="p">(</span><span class="n">cache</span><span class="o">-></span><span class="n">allow_native</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_NATIVE_NR</span><span class="p">,</span>
<span class="n">syscall_nr</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">sd</span><span class="o">-></span><span class="n">arch</span> <span class="o">==</span> <span class="n">SECCOMP_ARCH_COMPAT</span><span class="p">))</span>
<span class="k">return</span> <span class="n">seccomp_cache_check_allow_bitmap</span><span class="p">(</span><span class="n">cache</span><span class="o">-></span><span class="n">allow_compat</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_COMPAT_NR</span><span class="p">,</span>
<span class="n">syscall_nr</span><span class="p">);</span>
<span class="cp">#endif </span><span class="cm">/* SECCOMP_ARCH_COMPAT */</span><span class="cp">
</span>
<span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="nb">true</span><span class="p">);</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The bitmap cache is prepared in these three functions:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
* seccomp_cache_prepare - emulate the filter to find cacheable syscalls
* @sfilter: The seccomp filter
*
* Returns 0 if successful or -errno if error occurred.
*/</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">seccomp_cache_prepare</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">sfilter</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">struct</span> <span class="nc">action_cache</span> <span class="o">*</span><span class="n">cache</span> <span class="o">=</span> <span class="o">&</span><span class="n">sfilter</span><span class="o">-></span><span class="n">cache</span><span class="p">;</span>
<span class="k">const</span> <span class="k">struct</span> <span class="nc">action_cache</span> <span class="o">*</span><span class="n">cache_prev</span> <span class="o">=</span>
<span class="n">sfilter</span><span class="o">-></span><span class="n">prev</span> <span class="o">?</span> <span class="o">&</span><span class="n">sfilter</span><span class="o">-></span><span class="n">prev</span><span class="o">-></span><span class="n">cache</span> <span class="o">:</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="n">seccomp_cache_prepare_bitmap</span><span class="p">(</span><span class="n">sfilter</span><span class="p">,</span> <span class="n">cache</span><span class="o">-></span><span class="n">allow_native</span><span class="p">,</span>
<span class="n">cache_prev</span> <span class="o">?</span> <span class="n">cache_prev</span><span class="o">-></span><span class="n">allow_native</span> <span class="o">:</span> <span class="nb">NULL</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_NATIVE_NR</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_NATIVE</span><span class="p">);</span>
<span class="cp">#ifdef SECCOMP_ARCH_COMPAT
</span> <span class="n">seccomp_cache_prepare_bitmap</span><span class="p">(</span><span class="n">sfilter</span><span class="p">,</span> <span class="n">cache</span><span class="o">-></span><span class="n">allow_compat</span><span class="p">,</span>
<span class="n">cache_prev</span> <span class="o">?</span> <span class="n">cache_prev</span><span class="o">-></span><span class="n">allow_compat</span> <span class="o">:</span> <span class="nb">NULL</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_COMPAT_NR</span><span class="p">,</span>
<span class="n">SECCOMP_ARCH_COMPAT</span><span class="p">);</span>
<span class="cp">#endif </span><span class="cm">/* SECCOMP_ARCH_COMPAT */</span><span class="cp">
</span><span class="p">}</span>
<span class="cm">/* ...... */</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">seccomp_cache_prepare_bitmap</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_filter</span> <span class="o">*</span><span class="n">sfilter</span><span class="p">,</span>
<span class="kt">void</span> <span class="o">*</span><span class="n">bitmap</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">bitmap_prev</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">bitmap_size</span><span class="p">,</span> <span class="kt">int</span> <span class="n">arch</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">struct</span> <span class="nc">sock_fprog_kern</span> <span class="o">*</span><span class="n">fprog</span> <span class="o">=</span> <span class="n">sfilter</span><span class="o">-></span><span class="n">prog</span><span class="o">-></span><span class="n">orig_prog</span><span class="p">;</span>
<span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="n">sd</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">nr</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">bitmap_prev</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* The new filter must be as restrictive as the last. */</span>
<span class="n">bitmap_copy</span><span class="p">(</span><span class="n">bitmap</span><span class="p">,</span> <span class="n">bitmap_prev</span><span class="p">,</span> <span class="n">bitmap_size</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="cm">/* Before any filters, all syscalls are always allowed. */</span>
<span class="n">bitmap_fill</span><span class="p">(</span><span class="n">bitmap</span><span class="p">,</span> <span class="n">bitmap_size</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="n">nr</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">nr</span> <span class="o"><</span> <span class="n">bitmap_size</span><span class="p">;</span> <span class="n">nr</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* No bitmap change: not a cacheable action. */</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">test_bit</span><span class="p">(</span><span class="n">nr</span><span class="p">,</span> <span class="n">bitmap</span><span class="p">))</span>
<span class="k">continue</span><span class="p">;</span>
<span class="n">sd</span><span class="p">.</span><span class="n">nr</span> <span class="o">=</span> <span class="n">nr</span><span class="p">;</span>
<span class="n">sd</span><span class="p">.</span><span class="n">arch</span> <span class="o">=</span> <span class="n">arch</span><span class="p">;</span>
<span class="cm">/* No bitmap change: continue to always allow. */</span>
<span class="k">if</span> <span class="p">(</span><span class="n">seccomp_is_const_allow</span><span class="p">(</span><span class="n">fprog</span><span class="p">,</span> <span class="o">&</span><span class="n">sd</span><span class="p">))</span> <span class="c1">// <<----- HERE</span>
<span class="k">continue</span><span class="p">;</span>
<span class="cm">/*
* Not a cacheable action: always run filters.
* atomic clear_bit() not needed, filter not visible yet.
*/</span>
<span class="n">__clear_bit</span><span class="p">(</span><span class="n">nr</span><span class="p">,</span> <span class="n">bitmap</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="cm">/* ...... */</span>
<span class="cm">/**
* seccomp_is_const_allow - check if filter is constant allow with given data
* @fprog: The BPF programs
* @sd: The seccomp data to check against, only syscall number and arch
* number are considered constant.
*/</span>
<span class="k">static</span> <span class="kt">bool</span> <span class="nf">seccomp_is_const_allow</span><span class="p">(</span><span class="k">struct</span> <span class="nc">sock_fprog_kern</span> <span class="o">*</span><span class="n">fprog</span><span class="p">,</span>
<span class="k">struct</span> <span class="nc">seccomp_data</span> <span class="o">*</span><span class="n">sd</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">reg_value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">pc</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">op_res</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="o">!</span><span class="n">fprog</span><span class="p">))</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">pc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">pc</span> <span class="o"><</span> <span class="n">fprog</span><span class="o">-></span><span class="n">len</span><span class="p">;</span> <span class="n">pc</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="nc">sock_filter</span> <span class="o">*</span><span class="n">insn</span> <span class="o">=</span> <span class="o">&</span><span class="n">fprog</span><span class="o">-></span><span class="n">filter</span><span class="p">[</span><span class="n">pc</span><span class="p">];</span>
<span class="n">u16</span> <span class="n">code</span> <span class="o">=</span> <span class="n">insn</span><span class="o">-></span><span class="n">code</span><span class="p">;</span>
<span class="n">u32</span> <span class="n">k</span> <span class="o">=</span> <span class="n">insn</span><span class="o">-></span><span class="n">k</span><span class="p">;</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">code</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">BPF_LD</span> <span class="o">|</span> <span class="n">BPF_W</span> <span class="o">|</span> <span class="n">BPF_ABS</span><span class="p">:</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">k</span><span class="p">)</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_data</span><span class="p">,</span> <span class="n">nr</span><span class="p">):</span>
<span class="n">reg_value</span> <span class="o">=</span> <span class="n">sd</span><span class="o">-></span><span class="n">nr</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="nc">seccomp_data</span><span class="p">,</span> <span class="n">arch</span><span class="p">):</span>
<span class="n">reg_value</span> <span class="o">=</span> <span class="n">sd</span><span class="o">-></span><span class="n">arch</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="nl">default:</span>
<span class="cm">/* can't optimize (non-constant value load) */</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_RET</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="cm">/* reached return with constant values only, check allow */</span>
<span class="k">return</span> <span class="n">k</span> <span class="o">==</span> <span class="n">SECCOMP_RET_ALLOW</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_JMP</span> <span class="o">|</span> <span class="n">BPF_JA</span><span class="p">:</span>
<span class="n">pc</span> <span class="o">+=</span> <span class="n">insn</span><span class="o">-></span><span class="n">k</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_JMP</span> <span class="o">|</span> <span class="n">BPF_JEQ</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="k">case</span> <span class="n">BPF_JMP</span> <span class="o">|</span> <span class="n">BPF_JGE</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="k">case</span> <span class="n">BPF_JMP</span> <span class="o">|</span> <span class="n">BPF_JGT</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="k">case</span> <span class="n">BPF_JMP</span> <span class="o">|</span> <span class="n">BPF_JSET</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="k">switch</span> <span class="p">(</span><span class="n">BPF_OP</span><span class="p">(</span><span class="n">code</span><span class="p">))</span> <span class="p">{</span>
<span class="k">case</span> <span class="n">BPF_JEQ</span><span class="p">:</span>
<span class="n">op_res</span> <span class="o">=</span> <span class="n">reg_value</span> <span class="o">==</span> <span class="n">k</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_JGE</span><span class="p">:</span>
<span class="n">op_res</span> <span class="o">=</span> <span class="n">reg_value</span> <span class="o">>=</span> <span class="n">k</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_JGT</span><span class="p">:</span>
<span class="n">op_res</span> <span class="o">=</span> <span class="n">reg_value</span> <span class="o">></span> <span class="n">k</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_JSET</span><span class="p">:</span>
<span class="n">op_res</span> <span class="o">=</span> <span class="o">!!</span><span class="p">(</span><span class="n">reg_value</span> <span class="o">&</span> <span class="n">k</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="nl">default:</span>
<span class="cm">/* can't optimize (unknown jump) */</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">pc</span> <span class="o">+=</span> <span class="n">op_res</span> <span class="o">?</span> <span class="n">insn</span><span class="o">-></span><span class="n">jt</span> <span class="o">:</span> <span class="n">insn</span><span class="o">-></span><span class="n">jf</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="k">case</span> <span class="n">BPF_ALU</span> <span class="o">|</span> <span class="n">BPF_AND</span> <span class="o">|</span> <span class="n">BPF_K</span><span class="p">:</span>
<span class="n">reg_value</span> <span class="o">&=</span> <span class="n">k</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="nl">default:</span>
<span class="cm">/* can't optimize (unknown insn) */</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="cm">/* ran off the end of the filter?! */</span>
<span class="n">WARN_ON</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="k">return</span> <span class="nb">false</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If you are still with me, here and actually reading these lines, my goal is achieved: now you know everything about secure computing feature in Linux :)</p>
<h1 id="references">References</h1>
<ol>
<li>https://kernel.org/</li>
<li>https://elixir.bootlin.com/</li>
<li>https://developer.huaweicloud.com/ict/en/site-euleros/euleros</li>
<li>https://gist.github.com/fntlnz/08ae20befb91befd9a53cd91cdc6d507</li>
<li>https://docs.docker.com/engine/security/seccomp/</li>
<li>https://blog.sourcerer.io/writing-a-simple-linux-kernel-module-d9dc3762c234</li>
</ol>Dmitriy PhilimonovAn investigation story, where optimizing MySQL performance reveals the most hidden corners of Linux kerneliibench (aka the Index Insertion Benchmark) implemented as a sysbench workload2022-01-19T09:00:00+03:002022-01-19T09:00:00+03:00https://mysqlperf.github.io/mysql/sysbench-iibench<p>Our team has reimplemented iibench as a sysbench workload. You can read more about it on the project Github page.</p>MySQL Performance BlogOur team has reimplemented iibench as a sysbench workload. You can read more about it on the project Github page.Pedal to the metal or what else can speedup your CPU-bound application?2021-12-29T15:00:00+03:002021-12-29T15:00:00+03:00https://mysqlperf.github.io/mysql/elfremapper<h1 id="tldr">TL;DR</h1>
<ul>
<li>Moving code and data sections to huge pages increases application performance without any source code modification. We are able to get +10%.</li>
<li>It’s possible to quickly estimate the effect for your own project <em>without</em> any recompilation at all, details are <a href="https://github.com/dmitriy-philimonov/elfremapper">here</a>.</li>
<li>The final solution utilizes “classic” huge pages (<strong>not</strong> transparent huge pages), that’s why it could be referred as a next generation of <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>.</li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>If you ask an engineer how to solve your performance issue, the answer will depend on the engineer specialization.</p>
<ul>
<li>System architect opens the product documentation trying to find the bottleneck component. Replacing it should breathe new life into the whole system.</li>
<li>SDE immediately asks the access to the source code, then apparently he gets lost for the next couple of months analyzing the algorithmic complexity - maybe someone missed suboptimal “quadratic” piece of code or even worse?</li>
<li>SRE starts with profiling the core system processes, analyzes how they communicate with the OS kernel, how the memory is used: <code class="language-plaintext highlighter-rouge">perf top</code> / <code class="language-plaintext highlighter-rouge">perf stat</code> / <code class="language-plaintext highlighter-rouge">perf record</code> / <code class="language-plaintext highlighter-rouge">perf report</code> / <code class="language-plaintext highlighter-rouge">jemalloc</code> profile or <code class="language-plaintext highlighter-rouge">pidstat</code> / <code class="language-plaintext highlighter-rouge">vmstat</code> / <code class="language-plaintext highlighter-rouge">sar</code> or <code class="language-plaintext highlighter-rouge">strace</code> / <code class="language-plaintext highlighter-rouge">gdb</code>. If more or less up-to-date Linux kernel is at hand, then eBPF helps a lot. The result - a list of the heaviest functions and what mostly troubles OS (lack of network / disk bandwidth or RAM amount?).</li>
<li>A compiler developer opens the brave new world of profile guided binary code generation: PGO / AutoFDO / BOLT. He definitely offers LTO to strengthen the effect. It has been proven many times that applications become much faster, especially when non-x86 platforms are used. Recently all these technologies show incredibly outstanding results, working without any source code modification. Attractive, isn’t it?</li>
<li>A hardware specialist opens up the doors of NUMA-aware architectures. Let’s be honest, we have been using the NUMA servers for years, meanwhile we still have a great faith that all CPUs are equal and all RAM has the same access speed. By the way, “Random Access Memory” is a relic term from previous century, today it’s just an illusion. The set of L1/L2/L3 caches + RAM which belongs to a particular NUMA node - the further the memory is, the slower access, more complex hardware synchronization. Forget about gigabytes of RAM installed into your server, if you need the real performance, imagine that all your memory is extremely simple, predictive, with sequential access, exclusively owned by the executing thread and the amount of it is extraordinarily small (a couple of megabytes?). It’s really tough to apply all these knowledge to a particular project, but it’s definitely worth trying to do it.</li>
<li>A OS developer, which has a terrible burden of backward compatibility, certainly tells you stories about petabytes of production ready applications, then he opens your eyes to amazing new APIs for asynchronous NVME access (<code class="language-plaintext highlighter-rouge">libaio</code>, <code class="language-plaintext highlighter-rouge">io-uring</code>), tasks schedulers for clouds (Linux kernel >= 4) and technologies for optimizing the virtual / physical address translation.</li>
</ul>
<p>The range of available tools and technologies is quite big, today I’ll tell you about our experience of applying huge pages using the MySQL server as an example. Here we improve CPU utilization via virtual to physical address translation optimization.</p>
<p>There’ll be no stories about OS virtual address subsystem and how it’s implemented inside the Linux kernel, what is the MMU and TLB. There’re a lot of official articles all over the Internet and excellent books where all the theory / practical approaches are described in details. If you forget about anything, refresh your knowledge using your favorite book about modern operating systems.</p>
<p>Of course, the huge pages technology isn’t new. How many decades have gone since the Linux 2.6.16 release? However, the number of products using it is vanishingly small. For example, in MySQL server huge pages might be used for the InnoDB buffer pool (internal B-tree cache), wherein it’s implemented over the old SystemV shared memory API, which requires additional specific OS configuration.</p>
<p>Ok, even employing old APIs is good, but where’re the applications which seize the opportunity to exploit huge pages for their code and data segments? E.g. <code class="language-plaintext highlighter-rouge">.text</code>/<code class="language-plaintext highlighter-rouge">.data</code>/<code class="language-plaintext highlighter-rouge">.bss</code> are located into the standard process address space, which might be mapped to the huge pages too. If application has huge <code class="language-plaintext highlighter-rouge">.text</code>/<code class="language-plaintext highlighter-rouge">.data</code>/<code class="language-plaintext highlighter-rouge">.bss</code> segments, the access to them suffers significantly from iTLB/dTLB misses. I think the number of vendors who really uses such approach could be counted on the fingers of one hand. The relevant code examples I’ve found so far:</p>
<ul>
<li><a href="https://github.com/libhugetlbfs/libhugetlbfs/blob/master/elflink.c">libhugetlbfs</a>: the <code class="language-plaintext highlighter-rouge">remap_segments()</code> function</li>
<li><a href="https://chromium.googlesource.com/chromium/src/+/refs/heads/master/chromeos/hugepage_text/hugepage_text.cc">Google Chromium</a>: the <code class="language-plaintext highlighter-rouge">RemapHugetlbText*()</code> functions</li>
<li><a href="https://github.com/facebook/hhvm/blob/master/hphp/runtime/base/program-functions.cpp">Facebook HHVM</a>: the <code class="language-plaintext highlighter-rouge">HugifyText</code> function</li>
<li><a href="https://github.com/intel/iodlr/blob/master/large_page-c/large_page.c">Intel Optimizations for Dynamic Language Runtimes</a>: the <code class="language-plaintext highlighter-rouge">MoveRegionToLargePages</code> function</li>
</ul>
<p>Nevertheless, all related published papers have equal conclusions: the applications become faster if code and data are moved to the mappings backed with huge pages. That’s why our team decided to conduct our own research in this field.</p>
<p>The theory here is pretty simple: the larger the page, the bigger address space could be covered by TLB. As soon as the number of frequently used pages exceeds the number of TLB records, the performances drops dramatically. By the way, modern CPUs has several TLBs, usually in L1 and L2 levels. <a href="https://medium.com/applied/applied-c-memory-latency-d05a42fe354e">In this article</a> the benchmark is described which shows the performance impact of L1/L2 TLB misses for the exact CPU. Moreover, different CPU architectures support different set of huge pages. E.g. x86_64: 2M, 1G; ppc64: 16М; AArch64: 64K, 2M, 512M, 16G (depends on CPU model and OS kernel configuration). That’s why the decision what page size to choose is determined by the particular application and the problem it solves. For MySQL server 8.0 the code and data segments have size about 120 MB (not too big). For our goals, only x86_64 and AArch64 are important, therefore we picked the default 2 MB huge pages.</p>
<p>The next question is what huge page technology to choose? The Linux OS offers:</p>
<ul>
<li>classic huge pages</li>
<li>transparent huge pages</li>
</ul>
<p>Good old Morpheus comes to mind here</p>
<p><img src="/assets/images/elfremapper/morpheus.png" alt="morpheus" /></p>
<ul>
<li>The blue pill (transparent huge pages) - you turn on the technology in the kernel, then you prepare correctly aligned memory mapping and recommend the Linux kernel to use it. That’s all. After you “wake up in your bed and believe that everything else was just a dream”.</li>
<li>The red pill (classic huge pages) - you dig further and figure our “how deep the rabbit hole is”.</li>
</ul>
<p>The easiest way is to take the blue pill. However, we were seriously concerned about <a href="https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/">the Percona experience</a> in THP usage for generic databases. So, the pitfalls:</p>
<ul>
<li>Physical memory defragmentation. Have you ever noticed the “khugepaged” process? It could suddenly stop your application even if you never planned to use any transparent huge pages at all. It relocates the processes all over system during the defragmentation. Even the major huge pages consumer (like MySQL server) endures sporadic spikes in TPS/latency during that process.</li>
<li>Unpredictable behaviour. What’s the life’s bright hope for all DBAs? That’s correct: the technology stack must be lightweight and simple, the system must be predictable and fast. THP is the kernel optimization. It might provide the performance boost, or it might not work at all, or it could work temporarily (in some cases), or it could work all the time, but with some limitations, and only the concrete version of kernel knows what’s going on. High quality performance estimation is a hard job alone, meanwhile performance estimation of the kernel optimization is much, much harder. Of course, if you are the Linux kernel developer, it’s not a problem, but in that case I doubt you read my article :)</li>
<li>Swapping. Older Linux kernels split a huge page into default pages before dumping it to the disk. When the huge page is loaded back, the bunch of default pages merges together into one huge page. Obviously, this process hits the performance badly. Classic huge pages are allocated in RAM permanently, they never go to <code class="language-plaintext highlighter-rouge">swap</code>. At the time of writing these words I saw the Linux kernel patches which solve this problem.</li>
<li>Memory consumption growth. Even if only a couple of kilobytes is needed, we still allocate the full huge page (e.g. 2M). I agree, that this’s the common symptom for both classic and transparent huge pages, but the programmer don’t have any control over THP at all. All publications keep mentioning this issue regularly, so I decided to follow the tradition and to add it too.</li>
</ul>
<p>I must admit, that the Linux kernel improves the THP in each release, and in the near future the whole situation may fundamentally change, maybe even the future is already here. Anyway, Google/Facebook/Intel widely offers THP in their solutions for a reason. However, our team wanted the result right here and right now, moreover changing the Linux kernel on production servers takes quite a lot of time.</p>
<hr />
<p>So, we took the red pill.</p>
<h1 id="where-to-begin">Where to begin?</h1>
<p>I believe, every team who tried to remap code and data segments to huge pages got started having life-fire compat training with <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>. This library is extremely ancient, supports huge variety of Unix-like OSs. If we speak about Linux - very old kernels (2.6.16) and toolchains. I suspect it was designed originally for small embedded systems powered by non-x86 specialized processors with extremely tiny amount of RAM on board. Anyway, if you’re interested in touching the life history and gaining more wisdom - welcome to the <a href="https://github.com/libhugetlbfs/libhugetlbfs">project site</a>.</p>
<p>Exploiting <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> is undoubtedly an uneasy job which takes quite a lot of time, however, the resulting speedup was undisputed. MySQL server produced +10% additional TPS (transactions per second) in OLTP PS (point select), 1vCPU virtual instance, Linux EulerOS, x86_64. iTLB-misses became several times lower. AArch64 platform showed even bigger performance improvement. Our team additionally researched the remapping <code class="language-plaintext highlighter-rouge">text</code>/<code class="language-plaintext highlighter-rouge">text + data</code>/<code class="language-plaintext highlighter-rouge">text + data + bss</code> segments, the result is represented in the following chart (AArch64 CPU: Huawei Kunpeng 920):</p>
<p><img src="/assets/images/elfremapper/chart.png" alt="chart" /></p>
<p>Of course, it was high loaded CPU-bound benchmarks (“serious CPU starvation”), nevertheless +10% definitely worths further research. Along the way several issues/restrictions appeared:</p>
<ul>
<li>Turning ASLR on the server (MySQL default compiled with PIE) caused SIGSEGV. Following investigation revealed the clear bug inside <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>, which was immediately reported (with the fixing patch applied): <a href="https://github.com/libhugetlbfs/libhugetlbfs/issues/49">https://github.com/libhugetlbfs/libhugetlbfs/issues/49</a>. After a year I was notified that the testing team can’t reproduce the problem. I’m grieving…</li>
<li>Maximum number of segments which might be remapped is 3. I think the reason is in the history again, GNU BFD linker used to generate 2 LOAD ELF segments only (“r-x” and “rw-“), so the limit of 3 has sense. However, the recent security requirements made it a bit smarter - now it generates a separate LOAD segment for constants (“r–”) by default. At the same time, the GNU BFD linker isn’t that smart, looks like it cut the read only segment off the both “r-x” and “rw-“ segments. As a result, the 4 LOAD segments are generated, so only first 3 segments are remapped and the last segment, which is usually the biggest one and contains the <code class="language-plaintext highlighter-rouge">.data</code> and <code class="language-plaintext highlighter-rouge">.bss</code>, is left untouched.</li>
<li>If you conquer the previous problem and force linker to produce 2 LOAD segments, you notice that when the last LOAD segment is remapped, the HEAD segment simultaneously disappears from the virtual address space. It slips through your attention without any warnings or errors, the system call <code class="language-plaintext highlighter-rouge">brk</code> simply stops servicing the users (always returns <code class="language-plaintext highlighter-rouge">ENOMEM</code>). Affects only <code class="language-plaintext highlighter-rouge">glibc</code> which uses <code class="language-plaintext highlighter-rouge">brk</code> for small allocations (less than 128K). After the accident, the <code class="language-plaintext highlighter-rouge">glibc</code> switches to <code class="language-plaintext highlighter-rouge">mmap</code> system call for all allocations. It’s unsure for me how this troubles the performance and the system in total, if you know, please, share your ideas and knowledge in the comments. P.S. Tested, that <code class="language-plaintext highlighter-rouge">jemalloc</code> isn’t affected, since it uses <code class="language-plaintext highlighter-rouge">mmap</code> only.</li>
<li>There’s no easy and robust integration into application - all the job is done in the DSO constructor without any logs. If error happens, the application doesn’t start. Figuring out what was the reason of failure takes time.</li>
<li><code class="language-plaintext highlighter-rouge">hugetlbfs</code> is used as API for huge page allocation. You <em>must</em> mount this file system and provide correct access rights for your application. In the cloud instances this additional dependency on <code class="language-plaintext highlighter-rouge">hugetlbfs</code> causes additional troubles with mounting. Meanwhile since the Linux 2.6.32 <code class="language-plaintext highlighter-rouge">mmap</code> system call provides easy and reliable interface for anonymous huge pages allocation. This issue stems from the backward compatibility with Linux 2.6.16.</li>
<li>The application must be built with the following linker flags: <code class="language-plaintext highlighter-rouge">common-page-size=2M max-page-size=2M</code>. I understand that this’s the useful security requirement, so it’s just a little inconvenience. Having the ability to remap to huge pages any application for test purposes / quick performance estimation might be a very pleasant bonus for developers.</li>
</ul>
<p>Some of the problems are critical. In other words, <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> is not production ready. Oh…</p>
<p><img src="/assets/images/elfremapper/facepalm.jpg" alt="facepalm" /></p>
<hr />
<p>Rolling up my sleeves higher and taking more air into my lungs, I began a slow and thorough dive into <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> in order to make a server analogue for our MySQL, devoid of all these disadvantages, and also integrating the solution into the project code in the future.</p>
<h1 id="how-is-a-program-loaded">How is a program loaded?</h1>
<p>I’d like to put some restrictions on the following research: I deal with Linux 64-bit / ELF format / <code class="language-plaintext highlighter-rouge">glibc</code>.</p>
<p>To sort things out, it’s necessary to start our journey with describing of application launching algorithm in OS Linux, i.e. what is hidden behind <code class="language-plaintext highlighter-rouge">execve</code> system call? Yet again, there’re a lot of gorgeous articles which highlight all steps / functions in <code class="language-plaintext highlighter-rouge">glibc</code> / Linux kernel. For example, <a href="https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html">here</a> the GNU <code class="language-plaintext highlighter-rouge">ls</code> invocation in <code class="language-plaintext highlighter-rouge">bash</code> is shown in details.</p>
<p>From all that plethora of technical information, I’ll focus on the following:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">execve</code> for ELF file eventually calls <code class="language-plaintext highlighter-rouge">load_elf_binary</code> in <code class="language-plaintext highlighter-rouge">fs/binfmt_elf.c</code> in Linux kernel</li>
<li><code class="language-plaintext highlighter-rouge">load_elf_binary</code>:
<ul>
<li>parses ELF file, search for code and data segments</li>
<li>maps code and data segments to virtual memory, then HEAP segment is initialized right after the data segment</li>
<li>maps VDSO segment</li>
<li>looks for the current ELF interpreter (usually, it’s the dynamic linker from <code class="language-plaintext highlighter-rouge">glibc</code>), then loads it into the memory (again: interpreter’s code and data are loaded to the memory)</li>
</ul>
</li>
<li>Linux kernel executes all other necessary functions, then all the information about just created mappings is saved on the stack, then the dynamic <code class="language-plaintext highlighter-rouge">glibc</code> linker is invoked (or the application itself if interpreter is not specified, i.e. the binary is linked statically)</li>
<li>Dynamic linker (<code class="language-plaintext highlighter-rouge">glibc</code>):
<ul>
<li>initializes the list of all mappings which were created by the kernel</li>
<li>reads the DSO list, which the application depends on</li>
<li>looks for the DSOs in the system (<code class="language-plaintext highlighter-rouge">LD_LIBRARY_PATH</code>/<code class="language-plaintext highlighter-rouge">RPATH</code>/<code class="language-plaintext highlighter-rouge">RUNPATH</code>), then loads them and adds the meta information to the application’s list of all mappings</li>
<li>executes DSO constructors</li>
<li>transfers the execution to the <code class="language-plaintext highlighter-rouge">main</code> function</li>
</ul>
</li>
</ul>
<p>So, the list of all application mappings are stored in:</p>
<ul>
<li>Linux kernel</li>
<li><code class="language-plaintext highlighter-rouge">glibc</code> library</li>
</ul>
<p>And the description of all these mappings is originally written in ELF file.</p>
<p>Linux kernel publishes the application mappings in <code class="language-plaintext highlighter-rouge">/proc/$pid/smaps</code> (detailed list) and <code class="language-plaintext highlighter-rouge">/proc/$pid/maps</code> (short list). Example for short list in Ubuntu 20.04 (kernel 5.4):</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> /proc/self/maps
555555554000-555555556000 r--p 00000000 08:02 24117778 /usr/bin/cat
555555556000-55555555b000 r-xp 00002000 08:02 24117778 /usr/bin/cat
55555555b000-55555555e000 r--p 00007000 08:02 24117778 /usr/bin/cat
55555555e000-55555555f000 r--p 00009000 08:02 24117778 /usr/bin/cat
55555555f000-555555560000 rw-p 0000a000 08:02 24117778 /usr/bin/cat
555555560000-555555581000 rw-p 00000000 00:00 0 <span class="o">[</span>heap]
7ffff7abc000-7ffff7ade000 rw-p 00000000 00:00 0
7ffff7ade000-7ffff7dc4000 r--p 00000000 08:02 24125924 /usr/lib/locale/locale-archive
7ffff7dc4000-7ffff7dc6000 rw-p 00000000 00:00 0
7ffff7dc6000-7ffff7deb000 r--p 00000000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7deb000-7ffff7f63000 r-xp 00025000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7f63000-7ffff7fad000 r--p 0019d000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fad000-7ffff7fae000 <span class="nt">---p</span> 001e7000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fae000-7ffff7fb1000 r--p 001e7000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fb1000-7ffff7fb4000 rw-p 001ea000 08:02 24123961 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fb4000-7ffff7fb8000 rw-p 00000000 00:00 0
7ffff7fc9000-7ffff7fcb000 rw-p 00000000 00:00 0
7ffff7fcb000-7ffff7fce000 r--p 00000000 00:00 0 <span class="o">[</span>vvar]
7ffff7fce000-7ffff7fcf000 r-xp 00000000 00:00 0 <span class="o">[</span>vdso]
7ffff7fcf000-7ffff7fd0000 r--p 00000000 08:02 24123953 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7fd0000-7ffff7ff3000 r-xp 00001000 08:02 24123953 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ff3000-7ffff7ffb000 r--p 00024000 08:02 24123953 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffc000-7ffff7ffd000 r--p 0002c000 08:02 24123953 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffd000-7ffff7ffe000 rw-p 0002d000 08:02 24123953 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 <span class="o">[</span>stack]
ffffffffff600000-ffffffffff601000 <span class="nt">--xp</span> 00000000 00:00 0 <span class="o">[</span>vsyscall]
</code></pre></div></div>
<p>The list of LOAD segments in ELF:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>readelf <span class="nt">-Wl</span> /bin/cat
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R 0x8
INTERP 0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R 0x1
<span class="o">[</span>Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x0016e0 0x0016e0 R 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x004431 0x004431 R E 0x1000
LOAD 0x007000 0x0000000000007000 0x0000000000007000 0x0021d0 0x0021d0 R 0x1000
LOAD 0x009a90 0x000000000000aa90 0x000000000000aa90 0x000630 0x0007c8 RW 0x1000
DYNAMIC 0x009c38 0x000000000000ac38 0x000000000000ac38 0x0001f0 0x0001f0 RW 0x8
NOTE 0x000338 0x0000000000000338 0x0000000000000338 0x000020 0x000020 R 0x8
NOTE 0x000358 0x0000000000000358 0x0000000000000358 0x000044 0x000044 R 0x4
GNU_PROPERTY 0x000338 0x0000000000000338 0x0000000000000338 0x000020 0x000020 R 0x8
GNU_EH_FRAME 0x00822c 0x000000000000822c 0x000000000000822c 0x0002bc 0x0002bc R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x009a90 0x000000000000aa90 0x000000000000aa90 0x000570 0x000570 R 0x1
</code></pre></div></div>
<p>DSO dependencies:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>ldd /bin/cat
linux-vdso.so.1 <span class="o">(</span>0x00007ffff7fce000<span class="o">)</span>
libc.so.6 <span class="o">=></span> /lib/x86_64-linux-gnu/libc.so.6 <span class="o">(</span>0x00007ffff7dba000<span class="o">)</span>
/lib64/ld-linux-x86-64.so.2 <span class="o">(</span>0x00007ffff7fcf000<span class="o">)</span>
</code></pre></div></div>
<p>Analysis of <code class="language-plaintext highlighter-rouge">/proc/$pid/maps</code>:</p>
<ul>
<li>libc.so.6 - it’s libc-2.31.so</li>
<li>ld-linux-x86-64.so.2 - it’s ld-2.31.so</li>
<li>linux-vdso.so.1 - it’s [vdso], virtual DSO provided by kernel to speedup 4 (for x86_64) system calls, more information is <a href="https://man7.org/linux/man-pages/man7/vdso.7.html">here</a></li>
<li>[vvar] и [vsyscall] - obsolete implementation of [vdso] (kernel keeps backward compatibility)</li>
<li>[heap] и [stack] - everything is clear</li>
<li><code class="language-plaintext highlighter-rouge">/usr/bin/cat</code> - the <code class="language-plaintext highlighter-rouge">LOAD</code> segments from <code class="language-plaintext highlighter-rouge">readelf</code>, shifted in <code class="language-plaintext highlighter-rouge">0x555555554000</code> by kernel.</li>
</ul>
<p>Right now you probably point out, that, hey, there’re 4 LOAD segments, but Linux shows 5 mappings. It’s all about <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> technology (and the security again!): <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> section contains PLT table on a separate page (default 4K in our example). It’s filled by dynamic linker. When job is done, it removes the write access from this page (or pages if PLT is bigger). Now if the application is trying to be hacked by replacing the address of some popular external function (e.g. <code class="language-plaintext highlighter-rouge">printf@plt</code>), the application will be sent a <code class="language-plaintext highlighter-rouge">SIGSEGV</code> signal. Checking the <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> addresses:</p>
<ul>
<li>0x55555555e000 - 0x55555555f000: 4K (mapping start/end, one 4К page)</li>
<li>0x555555554000 + 0xaa90 = 0x55555555ea90 (kernel’s shift + <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> start address)</li>
<li>0x55555555ea90 & (~(0x1000 - 1)) = 0x55555555e000 (align previous result on 4K boundary => get the mapping start address)</li>
<li>0x55555555f000 - 0x570 = 0x55555555ea90 (from the end of <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> segment subtract the size of this segment => get unaligned mapping start address)</li>
<li>Numbers add up, and that’s good!</li>
</ul>
<p>Brief description for <code class="language-plaintext highlighter-rouge">readelf</code> output:</p>
<ul>
<li>Offset = offset in ELF file</li>
<li>VirtAddr = virtual address in application address space</li>
<li>PhysAddr = physical address (have never used this field, interesting what is it needed for?)</li>
<li>FileSiz = the size of ELF <code class="language-plaintext highlighter-rouge">data</code> section</li>
<li>MemSiz = FileSiz + (<code class="language-plaintext highlighter-rouge">.bss</code> segment: which is usually zeroed during the binary start)</li>
</ul>
<p>To get application mappings from <code class="language-plaintext highlighter-rouge">glibc</code>, exploit <code class="language-plaintext highlighter-rouge">dl_iterate_phdr</code> function: <a href="https://man7.org/linux/man-pages/man3/dl_iterate_phdr.3.html">manual</a>. In fact, this API returns you the true 4 LOAD segments, exactly as in <code class="language-plaintext highlighter-rouge">readelf</code> output.</p>
<hr />
<p>In total, armed with all information described above, I proceed to my main goal - remap LOAD segments.</p>
<h1 id="attempt-1">Attempt 1</h1>
<p>I use classical huge pages (<strong>not</strong> THP), size = 2M, my CPU is either AArch64 or x86_64.</p>
<p>I decided to name my newly born library <code class="language-plaintext highlighter-rouge">elfremapper</code> and mostly copy the main technology from the <code class="language-plaintext highlighter-rouge">remap_segments</code> function of <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>. To make life easier, I’ll make my library static. Investigate the <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> sources, do the same:</p>
<ol>
<li>Load all the LOAD segments via <code class="language-plaintext highlighter-rouge">dl_iterate_phdr</code> (thank you, <code class="language-plaintext highlighter-rouge">glibc</code>, for accuracy in presented data: no magic with <code class="language-plaintext highlighter-rouge">GNU_RELRO</code> ).</li>
<li>Check the segments don’t overlap (2M boundary aligned).</li>
<li>Additionally align segments if ASLR is turned on (in this case the segments have the fixed shift of 0x555555554000 and an additional random shift that is uniquely generated by the kernel every time the application is launched - every address produced by kernel is, of course, 4K aligned).</li>
<li>Allocate huge pages using <code class="language-plaintext highlighter-rouge">hugetlbfs</code> file system (for each LOAD segment create a separate <code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> mapping).</li>
<li>Copy each old mapping (4K based) to the new mapping (2M based): <code class="language-plaintext highlighter-rouge">mmap</code> -> <code class="language-plaintext highlighter-rouge">memcpy</code> -> <code class="language-plaintext highlighter-rouge">munmap</code>.</li>
<li>Don’t close file descriptors, we need files to stay in memory.</li>
<li>
<p>Check the data is copied: <code class="language-plaintext highlighter-rouge">mmap</code> the first file descriptor (left open on previous step), read the data - and - there’s no data in this mapping!</p>
<p>Well, the next step is reading the manual for <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> from official Linux kernel documentation and reading <code class="language-plaintext highlighter-rouge">man mmap</code>. The final conclusion is that data inside <code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> mapping is lost after <code class="language-plaintext highlighter-rouge">munmap</code>, because nothing is actually written to underlying file. It makes no difference what state of file descriptor is (opened or closed). <code class="language-plaintext highlighter-rouge">man mmap</code>:</p>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.</code></p>
</blockquote>
<p>Look more carefully to the <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> sources. Yes, it uses <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> before <code class="language-plaintext highlighter-rouge">memcpy</code>. That seems quite unsafe, but there’s no other choice than making the <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> mapping too, meanwhile opened files are immediately removed (via <code class="language-plaintext highlighter-rouge">unlink</code>) from <code class="language-plaintext highlighter-rouge">hugetlbfs</code> before anything is written to them. Continue:</p>
</li>
<li>Check: data is copied, the file descriptors are left unclosed (remember: all files are unlinked).</li>
<li>
<p>Unmap all our current code and data mappings - and - get <code class="language-plaintext highlighter-rouge">SIGSEGV</code> on the next code line following <code class="language-plaintext highlighter-rouge">munmap</code>.</p>
<p>What’s wrong??? <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> does <code class="language-plaintext highlighter-rouge">munmap</code> and doesn’t crash, while my solution breaks apart. Thinking…</p>
<p>When a code is executed, it is read by CPU from the same mapping as all others. The only difference is the execution permission (the mapping flag). It turns out that as soon as the code segment is removed from our virtual address space, reading the next assembly instruction is done from the address which does not belong to our process, and quite reasonably the <code class="language-plaintext highlighter-rouge">SIGSEGV</code> is sent. Why doesn’t <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> crash? The point is that <code class="language-plaintext highlighter-rouge">libhugetlbfs</code> is supplied as DSO and, of course, it has its own separate mapping, which remains intact (I remap the main application’s mappings only).</p>
<p>How to fix? Read <code class="language-plaintext highlighter-rouge">man mmap</code>:</p>
<blockquote>
<p><code class="language-plaintext highlighter-rouge">MAP_FIXED Don't interpret addr as a hint: place the mapping at exactly that address. addr must be suitably aligned: for most architectures a multiple of the page size is sufficient; however, some architectures may impose additional restrictions. If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail.</code></p>
</blockquote>
<p>Well-well, so if I use <code class="language-plaintext highlighter-rouge">MAP_FIXED</code> and <code class="language-plaintext highlighter-rouge">mmap</code> over the existing mapping, the kernel removes it silently. That’s interesting. What if enter the system call <code class="language-plaintext highlighter-rouge">mmap</code> from the old mapping and exit having the new mapping using the old virtual addresses? Should work, checking:</p>
</li>
<li><strong>Do not</strong> unmap the current mappings of code and data, utilize opened file descriptors (they point to <code class="language-plaintext highlighter-rouge">hugetlbfs</code> with prepared memory) and make <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> + <code class="language-plaintext highlighter-rouge">MAP_FIXED</code> mapping over existing ones (i.e. <code class="language-plaintext highlighter-rouge">mmap</code> consumes both: base virtual address and file descriptor). It works!</li>
<li>Check <code class="language-plaintext highlighter-rouge">/proc/$pid/maps</code> - instead of the application name, our LOAD segments are represented with something like <code class="language-plaintext highlighter-rouge">/dev/hugepages/g4PcpN (deleted)</code>. That’s expected if <code class="language-plaintext highlighter-rouge">hugetlbfs</code> is mounted on <code class="language-plaintext highlighter-rouge">/dev/hugepages</code> and the temporary files are created by <code class="language-plaintext highlighter-rouge">mktemp</code>. Mission accomplished.</li>
</ol>
<p>Small help: mounting <code class="language-plaintext highlighter-rouge">hugetlbfs</code> and huge pages allocation:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">mkdir</span> /dev/hugepages
<span class="nv">$ </span>mount <span class="nt">-t</span> hugetlbfs <span class="nt">-o</span> <span class="nv">pagesize</span><span class="o">=</span>2M none /dev/hugepages
<span class="nv">$ </span><span class="nb">sudo </span>bash <span class="nt">-c</span> <span class="s2">"echo 100 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"</span>
</code></pre></div></div>
<p>Summary:</p>
<ul>
<li>Remapping code is linked statically, i.e. it remaps itself at some execution point.</li>
<li>I used <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> for code and data segments. “There will be consequences” - you might tell me with a smile, and you’re absolutely right!</li>
<li>How many huge pages are consumed? It appears, that the number of pages calculated from the ELF file and the number of consumed pages (<code class="language-plaintext highlighter-rouge">nr_hugepages</code> - <code class="language-plaintext highlighter-rouge">free_hugepages</code>) are equal. That’s very important, because if there’s not enough huge pages, the error should be printed and, in general, the application must switch back to default system pages (usually 4K), i.e. put everything back.</li>
</ul>
<p>Rewrite algorithm with “out of memory” handling:</p>
<ol>
<li>open file descriptor on <code class="language-plaintext highlighter-rouge">hugetlbfs</code>, unlink the underlying file;</li>
<li>allocate huge memory via <code class="language-plaintext highlighter-rouge">mmap</code> using file descriptor;</li>
<li>check whether <code class="language-plaintext highlighter-rouge">mmap</code> succeeds, if not, print error, put everything already remapped back and stop;</li>
<li>copy our current segment (code or data) in recently allocated huge memory;</li>
<li>unmap huge segment, leave file descriptor opened</li>
<li>make the final <code class="language-plaintext highlighter-rouge">mmap</code> (fixed|shared), intentionally overlap with current segment;</li>
<li>note: the final <code class="language-plaintext highlighter-rouge">mmap</code> call never fails with out of memory error, because the mapping is <em>shared</em> (no need to reserve additional memory in kernel), and all huge memory is already allocated on step 2 and checked on step 3.</li>
</ol>
<p>Why <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> for code/data is dangerous?</p>
<ol>
<li><code class="language-plaintext highlighter-rouge">fork</code> stops working. Not right word, it still works, but it <strong>does not</strong> copy shared mappings between child and parent (which makes sense). This causes race conditions between child/parent accessing the same data sections => sporadic unpredictable crashes => undefined behaviour.</li>
<li>New versions of <code class="language-plaintext highlighter-rouge">gdb</code> stops working: <code class="language-plaintext highlighter-rouge">gdb attach</code> and loading of <code class="language-plaintext highlighter-rouge">core</code> files. Meanwhile old gdb versions still work, I don’t know why, there was no time to dig deeper.</li>
</ol>
<p>Another global problem arise here, which I’d describe separately: remapping breaks symbol resolution in <code class="language-plaintext highlighter-rouge">perf</code>. As a result, <code class="language-plaintext highlighter-rouge">perf top</code>/<code class="language-plaintext highlighter-rouge">perf record</code> show you a wide range of disaggregated addresses instead of function names. Good or bad, <code class="language-plaintext highlighter-rouge">perf</code> exploits ELF files for symbol loading, exact ELF files are read from the same <code class="language-plaintext highlighter-rouge">/proc/$pid/maps</code>, which changed in our case. Fortunately, the trouble may be fixed easily using already existing <code class="language-plaintext highlighter-rouge">perf</code> features. Back in the day, when JIT compilers were invented (like in popular Java or Python), the <code class="language-plaintext highlighter-rouge">perf</code> was extended with JIT API: the symbols are loaded from <code class="language-plaintext highlighter-rouge">/tmp/perf-$pid.map</code> file which has plain clear format (3 columns: start address, size and symbol name). So, what should be done here is:</p>
<ul>
<li>compile a binary with debug symbols</li>
<li>generate a file with symbols via <code class="language-plaintext highlighter-rouge">nm</code>:
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>nm <span class="nt">--numeric-sort</span> <span class="nt">--print-size</span> <span class="nt">--demangle</span> <span class="nv">$app</span> | <span class="nb">awk</span> <span class="s1">'$4{print $1" "$2" "$4}'</span> | <span class="nb">grep</span> <span class="nt">-Ee</span><span class="s2">"^0"</span> <span class="o">></span> /tmp/perf-<span class="nv">$pid</span>.map
</code></pre></div> </div>
</li>
</ul>
<h1 id="attempt-2">Attempt 2</h1>
<p><code class="language-plaintext highlighter-rouge">MAP_SHARED</code> haunts me. How to make the solution better? Take a detailed look into <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>: the final <code class="language-plaintext highlighter-rouge">mmap</code> is executed with <code class="language-plaintext highlighter-rouge">MAP_PRIVATE|MAP_FIXED</code> (step 6 of our algorithm). Well, change <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> to <code class="language-plaintext highlighter-rouge">MAP_FIXED</code>, check <code class="language-plaintext highlighter-rouge">fork</code>/<code class="language-plaintext highlighter-rouge">gdb</code> (it works!), run high load benchmarks. After ~3 weeks of different tests, the product crashes with <code class="language-plaintext highlighter-rouge">SIGSEGV</code> and the <code class="language-plaintext highlighter-rouge">core</code> dump is corrupted.</p>
<p>Detailed analysis:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">MAP_PRIVATE</code> leads to doubled huge page consumption. At the moment of final <code class="language-plaintext highlighter-rouge">mmap</code> (remember, step 6) the kernel copies all shared pages to private pages (copy-on-write + reservation). At least, during the algorithm execution, the memory is consumed very intensively. In my case all the pages (x2 comparing to the previous version) are not returned to the OS until the application is stopped, even if all file descriptors are closed on <code class="language-plaintext highlighter-rouge">hugetlbfs</code>. After the remapping is done, there’s no need in shared pages anymore. Strange. Didn’t invest more time in it.</li>
<li>So, final <code class="language-plaintext highlighter-rouge">mmap</code> leads to huge memory allocation, which means that “out of memory” error might occur. When there’s not enough huge pages, <code class="language-plaintext highlighter-rouge">mmap</code> returns <code class="language-plaintext highlighter-rouge">ENOMEM</code> and next assembly instruction execution produces <code class="language-plaintext highlighter-rouge">SIGSEGV</code>. Reminds me something I saw before… Further investigation reveals that <code class="language-plaintext highlighter-rouge">mmap</code> with overlapping memory regions has obnoxious <em>side-effect</em> in case of errors. What happens:
<ul>
<li>kernel detects and discards the overlapping memory regions;</li>
<li>then it tries to allocate huge pages, fails and returns <code class="language-plaintext highlighter-rouge">ENOMEM</code> error;</li>
<li>kernel <strong>does not</strong> return old memory region back, that’s why after <code class="language-plaintext highlighter-rouge">mmap</code> system call the code section is lost!</li>
</ul>
</li>
</ul>
<p>Thinking…</p>
<p><code class="language-plaintext highlighter-rouge">libhugetlbfs</code> doesn’t handle this error situation at all. If worse comes to the worst, the application is killed via <code class="language-plaintext highlighter-rouge">SIGABORT</code> by library itself. From the other side, the Google/Facebook/Intel products, based on THP, actively work with <code class="language-plaintext highlighter-rouge">mremap</code>. What if it can be used for huge pages too? The approach is very simple: create a private mapping backed with huge pages, copy segment content to it, then just move it to the new virtual address range (with overlap if needed).</p>
<p>Interesting. Try and get the error <code class="language-plaintext highlighter-rouge">MAP_FAILED</code> (<code class="language-plaintext highlighter-rouge">EINVAL</code>). Why?</p>
<p>If you look into Linux kernel source code, you’ll find that <code class="language-plaintext highlighter-rouge">mremap</code> system call still doesn’t support moving memory blocks backed with huge pages (<a href="https://github.com/torvalds/linux/blob/master/mm/mremap.c">https://github.com/torvalds/linux/blob/master/mm/mremap.c</a>):</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">is_vm_hugetlb_page</span><span class="p">(</span><span class="n">vma</span><span class="p">))</span>
<span class="k">return</span> <span class="nf">ERR_PTR</span><span class="p">(</span><span class="o">-</span><span class="n">EINVAL</span><span class="p">);</span>
</code></pre></div></div>
<p>Bug fix is a rollback to <code class="language-plaintext highlighter-rouge">MAP_SHARED</code> mapping for code/data segments. Sadness eats me alive…</p>
<h1 id="attempt-3">Attempt 3</h1>
<p>Still, how to make the solution better? Looks like the first thought (“static libraries makes our lives easier”) in this particular case is fundamentally wrong. Well, let’s create our own DSO!</p>
<p>Now it’s allowed to delete the application code mapping and <code class="language-plaintext highlighter-rouge">SIGSEGV</code> doesn’t chase you, because the DSO code segment stays intact. Moreover, it’s possible to drop the <code class="language-plaintext highlighter-rouge">hugetlbfs</code> dependency, because we have relatively new kernel. As you probably already noticed, in the previous attempts, when the remapping code is linked statically, the <code class="language-plaintext highlighter-rouge">hugetlbfs</code> usage is mandatory. The reason is necessity of changing the mapping for the concrete virtual address range in one system call. That’s why when the <code class="language-plaintext highlighter-rouge">mmap</code> is used, its API is fully utilized: the virtual addresses and the file descriptor backed with <code class="language-plaintext highlighter-rouge">hugetlbfs</code> are specified. Applying new approach with DSO relaxes this limitation: several system calls could be executed and reliable error handling is possible, furthermore in bad cases putting the old mappings back looks like an easy job. Of course, on the other hand there’s always the itching idea to add a custom Linux kernel system call which can do all the magic instead of me, but for production purposes it’s not an option.</p>
<p>Change to algorithm to the following:</p>
<ol>
<li>Make anonymous 4K mapping with one aim only - force the kernel to find the empty space with appropriate size.</li>
<li>Move (via <code class="language-plaintext highlighter-rouge">mremap</code>) current working code and data mappings to the space allocated on the step 1 => I get overlapping address ranges and previously allocated memory block disappears without a single page fault; in addition, no <code class="language-plaintext highlighter-rouge">SIGSEGV</code> here, because CPU is executing DSO code segment right now and nobody touches it.</li>
<li>Allocate anonymous huge memory on the old virtual address range (now this memory is vacant), call <code class="language-plaintext highlighter-rouge">mmap</code> (private + fixed + huge2m).</li>
<li>If “out of memory” occurs, discard all huge memory, move old working code and data mappings back to the old virtual addresses and stop the algorithm, otherwise continue.</li>
<li>Copy all content of old mappings to huge pages which have been just allocated.</li>
<li>Remove old 4K mappings, return memory to the OS</li>
</ol>
<p>As you can see, the DSO makes a difference. However, the pitfalls exist everywhere, so what to expect?</p>
<ul>
<li>GOT/PLT tables <strong>must</strong> be filled in advance, otherwise the <code class="language-plaintext highlighter-rouge">SIGSEGV</code> returns. The fact is that <code class="language-plaintext highlighter-rouge">glibc</code> dynamic linker works in lazy mode by default. That means that it resolves the external function names only if they are used by the application or DSO. These tables are created inside LOAD segments of “consumers” (remember the story about <code class="language-plaintext highlighter-rouge">GNU_RELGO</code>?). Our own DSO uses some of <code class="language-plaintext highlighter-rouge">libc</code> functions (<code class="language-plaintext highlighter-rouge">mmap</code>/<code class="language-plaintext highlighter-rouge">mremap</code>/<code class="language-plaintext highlighter-rouge">memcpy</code>), that’s why our own PLT/GOT tables are filled too. So by default if some function isn’t bound (the table entry is empty), the dynamic linker is invoked (actually, the empty entry contains jump instruction which eventually calls linker). If the dynamic linker is called in the middle of remapping process, the <code class="language-plaintext highlighter-rouge">glibc</code> code crashes somewhere inside. That’s weird, because the <code class="language-plaintext highlighter-rouge">heap</code> in my particular experiment was isolated (linker uses it for storing DSO list), the LOAD segment of our DSO is fixed and intact, only the main application segments are moved… Didn’t invest more time to figure out why it happens, if you know some details, please, share :) I was able to fix this issue quickly by adding <code class="language-plaintext highlighter-rouge">-Wl,-znow</code> linker flag, which eventually notifies the dynamic linker to do all the job before any user code is executed.</li>
<li><code class="language-plaintext highlighter-rouge">fork</code> starts working as expected, because code/data segments are private now. However, <code class="language-plaintext highlighter-rouge">fork</code> consumes memory, and if during the system call the huge pages run out, the application gets <code class="language-plaintext highlighter-rouge">SIGBUS</code>. Well, it’s much better than undefined behaviour and memory corruption, but still not ideal. It’s definitely desired that in such cases the child, for example, switch back to default pages and continue to work as nothing happened. I must confess, I didn’t add <code class="language-plaintext highlighter-rouge">SIGBUS</code> handler or make other attempts to fix this case, to the moment this was discovered I was completely exhausted and just moved the remapping function to the place where the <code class="language-plaintext highlighter-rouge">fork</code> is already undoubtedly executed. <code class="language-plaintext highlighter-rouge">THP</code> comes to mind: this technology should solve such cases automatically somewhere inside Linux kernel. Again, if you know how it works inside, shed the light to the issue.</li>
</ul>
<h1 id="numa">NUMA</h1>
<p>As is well known, each NUMA node allocates huge pages separately. Still don’t believe?</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> /sys/devices/system/node/node<span class="k">*</span>/hugepages/hugepages-2048kB
</code></pre></div></div>
<p>Our Linux kernel deployed on servers with NUMA has very tricky and harsh behaviour. We use NUMA servers for cloud and each virtual machine is usually settled inside one NUMA node (<code class="language-plaintext highlighter-rouge">/sys/fs/cgroup/cpuset/$vm/cpuset.mems</code>). When <code class="language-plaintext highlighter-rouge">mmap</code> system call executes, the kernel scans all available huge pages on <em>all</em> NUMA nodes and if memory is enough (in total), the call succeeds. Then during the following <code class="language-plaintext highlighter-rouge">page fault</code> the kernel applies <code class="language-plaintext highlighter-rouge">cgroup</code> rules and tries to find huge pages on local NUMA node, can’t find them and sends <code class="language-plaintext highlighter-rouge">SIGBUS</code> to the application. As a result, the fancy error handling doesn’t work sometimes.</p>
<p>As a mitigation, the following scheme was invented:</p>
<ul>
<li>Roughly estimate the amount of VMs which could be settled on each NUMA node, then calculate and allocate a proper amount of huge pages statically for this particular NUMA node (via <code class="language-plaintext highlighter-rouge">nr_hugepages</code>)</li>
<li>Additional consumption is covered by <code class="language-plaintext highlighter-rouge">overcommit hugepages</code> with a good reservation (let’s say 10 GB):
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo </span>5120 <span class="o">></span> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages
</code></pre></div> </div>
</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">overcommit hugepages</code> allocates dynamically, so there’s the small nonzero possibility that the kernel can’t allocate the pages instantly, that memory might be fragmented, etc. Even though, it’s still makes sense to use such approach for long living processes like database servers. They usually restarted once in a month (e.g. the rolling update), and our remapping algorithm is executed during the process start only.</p>
<h1 id="saving-the-heap">Saving the HEAP</h1>
<p>Remember, when I iterated through the weaknesses of <code class="language-plaintext highlighter-rouge">libhugetlbfs</code>, I told you that the library might wipe out the HEAP segment from the application address space. Now I’ll tell you more about this process.</p>
<p>When we instruct linker to align LOAD segments in the ELF file on 2M boundary (<code class="language-plaintext highlighter-rouge">common-page-size=2M max-page-size=2M</code>), this process doesn’t touch HEAP segment. Everything is correct, kernel creates HEAP when the application starts, meanwhile linker works during compilation time. That means that <code class="language-plaintext highlighter-rouge">[heap]</code> has default 4K alignment and is “glued” to the last LOAD segment. When the last LOAD segment is remapped to the huge pages, its end is aligned on 2M boundary, which of course overlaps with <code class="language-plaintext highlighter-rouge">[heap]</code>. Then the last LOAD segment data is copied, but nobody copies the <code class="language-plaintext highlighter-rouge">[heap]</code> data. Furthermore, the remapping is done during the process start, which means that the <code class="language-plaintext highlighter-rouge">[heap]</code> is still quite small, so it often is totally located inside the last LOAD segment “tail”. The result is tragic:</p>
<ul>
<li>all data stored on the heap is lost;</li>
<li>HEAP segment itself is lost - the <code class="language-plaintext highlighter-rouge">brk</code> system call starts always returning <code class="language-plaintext highlighter-rouge">ENOMEM</code>.</li>
</ul>
<p>Why does Linux kernel weed out <code class="language-plaintext highlighter-rouge">[heap]</code> completely from application address space if huge page entirely overlaps it - the open question. If you know, tell me, please :)</p>
<p>We solved this problem quite simple:</p>
<ul>
<li>read the <code class="language-plaintext highlighter-rouge">[heap]</code> current begin/end addresses from <code class="language-plaintext highlighter-rouge">/proc/$pid/maps</code>, then if the last LOAD segment (2M aligned) overlaps with it, all the HEAP data is copied to the huge pages too; after the remapping all the virtual addresses stay the same, the data isn’t corrupted.</li>
<li>if <code class="language-plaintext highlighter-rouge">[heap]</code> entirely overlaps with the last LOAD segment (2M aligned), it is artificially extended (manual call to <code class="language-plaintext highlighter-rouge">brk</code>, size = 2M). That way some part of the HEAP segment survives after overlapping. It has been experimentally proven, that in this case <code class="language-plaintext highlighter-rouge">brk</code> continues to work correctly, <code class="language-plaintext highlighter-rouge">glibc</code> memory allocator works correctly too. What happens if <code class="language-plaintext highlighter-rouge">glibc</code> allocator attempts to free the memory which was remapped to huge pages is unknown. I suspect that <code class="language-plaintext highlighter-rouge">brk</code> returns error and <code class="language-plaintext highlighter-rouge">glibc</code> handles it correctly, because I have never detected crashes with such symptoms.</li>
</ul>
<p>If you use a different allocator which utilizes <code class="language-plaintext highlighter-rouge">mmap</code> system call only (exploit anonymous pages, e.g. <code class="language-plaintext highlighter-rouge">jemalloc</code>), you’ll not face this problem at all.</p>
<p>Also, if ASLR is turned on, the kernel generates randomly shifted starting address for the <code class="language-plaintext highlighter-rouge">[heap]</code>, which is usually located quite far from application LOAD segments (>2M), so this’s the rare case when ASLR solves the problems instead of adding them.</p>
<h1 id="perf">perf</h1>
<p>Many words were written about what has been done, how to overcome pitfalls and finally build robustly working application. In addition, it was mentioned that technology increases performance (for MySQL server - TPS in OLTP tests). Nevertheless, it’s much better to observe the positive effects for CPU via <code class="language-plaintext highlighter-rouge">perf</code> tool. The thing is, each application has its own set of bottlenecks and applying our experience to your product may give you zero speedup, meanwhile <code class="language-plaintext highlighter-rouge">perf</code> always shows how the whole picture is changed from CPU perspective.</p>
<p>Analysis here is based on <a href="https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/">this article</a>, in particular, I’m going to use the following table from official Intel documentation:</p>
<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Description</th>
<th>Event Num.</th>
<th>Umask Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK</td>
<td>Misses in all TLB levels that cause a page walk of any page size.</td>
<td>08H</td>
<td>01H</td>
</tr>
<tr>
<td>DTLB_STORE_MISSES.MISS_CAUSES_A_WALK</td>
<td>Miss in all TLB levels causes a page walk of any page size.</td>
<td>49H</td>
<td>01H</td>
</tr>
<tr>
<td>DTLB_LOAD_MISSES.WALK_DURATION</td>
<td>This event counts cycles when the page miss handler (PMH) is servicing page walks caused by DTLB load misses.</td>
<td>08H</td>
<td>10H</td>
</tr>
<tr>
<td>ITLB_MISSES.MISS_CAUSES_A_WALK</td>
<td>Misses in ITLB that causes a page walk of any page size.</td>
<td>85H</td>
<td>01H</td>
</tr>
<tr>
<td>ITLB_MISSES.WALK_DURATION</td>
<td>This event counts cycles when the page miss handler (PMH) is servicing page walks caused by ITLB misses.</td>
<td>85H</td>
<td>10H</td>
</tr>
<tr>
<td>PAGE_WALKER_LOADS.DTLB_MEMORY</td>
<td>Number of DTLB page walker loads from memory.</td>
<td>BCH</td>
<td>18H</td>
</tr>
<tr>
<td>PAGE_WALKER_LOADS.ITLB_MEMORY</td>
<td>Number of ITLB page walker loads from memory.</td>
<td>BCH</td>
<td>28H</td>
</tr>
</tbody>
</table>
<p>Make the <code class="language-plaintext highlighter-rouge">perf stat</code> request for CPU metrics (let’s say, time duration is 30 seconds):</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>perf <span class="nb">stat</span> <span class="nt">-e</span> cycles <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0x08,umask<span class="o">=</span>0x10,name<span class="o">=</span>dwalkcycles/ <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0x85,umask<span class="o">=</span>0x10,name<span class="o">=</span>iwalkcycles/ <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0x08,umask<span class="o">=</span>0x01,name<span class="o">=</span>dwalkmiss/ <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0x85,umask<span class="o">=</span>0x01,name<span class="o">=</span>iwalkmiss/ <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0xbc,umask<span class="o">=</span>0x18,name<span class="o">=</span>dmemloads/ <span class="se">\</span>
<span class="nt">-e</span> cpu/event<span class="o">=</span>0xbc,umask<span class="o">=</span>0x28,name<span class="o">=</span>imemloads/ <span class="se">\</span>
<span class="nt">-p</span> <span class="nv">$app_pid</span> <span class="nb">sleep </span>30
</code></pre></div></div>
<p>For OLTP workload generation the <code class="language-plaintext highlighter-rouge">sysbench</code> is used, the sources are <a href="https://github.com/akopytov/sysbench">here</a>. Then compile the MySQL 8.0 (for our case it’s 8.0.21).</p>
<p>Run server on NUMA0:</p>
<ul>
<li>Put database in /dev/shm (InnoDB / UTF8);</li>
<li>Create 10 tables, 1M rows each (2.4 GB)</li>
<li>CPU: Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz, no boost/turbo</li>
<li>No ASLR</li>
</ul>
<p>MySQL configuration details:</p>
<ul>
<li>innodb_buffer_pool = 88G</li>
<li>innodb_buffer_pool_instances = 64</li>
<li>innodb_data_file_path=ibdata1:128M:autoextend</li>
<li>threadpool_size = 64</li>
<li>performance_schema=ON</li>
<li>performance_schema_instrument=’wait/synch/%=ON’</li>
<li>innodb_adaptive_hash_index=0</li>
<li>log-bin=mysql-bin</li>
</ul>
<p>Then run <code class="language-plaintext highlighter-rouge">sysbench</code> (OLTP PS / 128 threads) on NUMA1:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sysbench \
--threads=128 \
--report-interval=1 \
--thread-init-timeout=180 \
--db-driver=mysql \
--mysql-socket=/tmp/mysql.sock \
--mysql-db=sbtest \
--mysql-user=root \
--tables=10 \
--table-size=1000000 \
--rand-type=uniform \
--time=3600 \
--histogram \
--db-ps-mode=disable \
oltp_point_select run
</code></pre></div></div>
<p>Workload is CPU-bound / read-only.</p>
<p><code class="language-plaintext highlighter-rouge">perf stat</code> original server (TPS=581K):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 3,213,429,932,057 cycles (57.15%)
194,753,410,016 dwalkcycles (57.14%)
139,241,762,335 iwalkcycles (57.14%)
3,977,146,385 dwalkmiss (57.14%)
4,969,951,701 iwalkmiss (57.14%)
15,102,884 dmemloads (57.14%)
30,794 imemloads (57.14%)
30.005683086 seconds time elapsed
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">perf stat</code> after remapping code/data to huge pages (TPS=641K):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 3,213,038,157,768 cycles (57.15%)
78,822,186,791 dwalkcycles (57.15%)
18,042,959,892 iwalkcycles (57.15%)
1,306,771,287 dwalkmiss (57.15%)
695,958,356 iwalkmiss (57.14%)
18,090,550 dmemloads (57.15%)
4,574 imemloads (57.15%)
30.005697688 seconds time elapsed
</code></pre></div></div>
<p>Compare:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">iwalkcycles</code> drops in 7.7 times, <code class="language-plaintext highlighter-rouge">dwalkcycles</code> in 2.4 time</li>
<li><code class="language-plaintext highlighter-rouge">iwalkmiss</code> - 7.1 times, <code class="language-plaintext highlighter-rouge">dwalkmiss</code> - 3 times</li>
<li>TPS: +10.3%</li>
</ul>
<p>It should be acknowledged that applying compiler specific technologies, which significantly improves performance, decrease the positive effect from huge pages, however, it still exists. The reason is simple: all compilers seek to concentrate hot code in one place, which enhance cache usage inside all CPU components, including TLB.</p>
<p>Apply PGO/LTO/BOLT to the same MySQL 8.0.21 code (training workload is OLTP RW), run the same test.</p>
<p><code class="language-plaintext highlighter-rouge">perf stat</code> without huge pages (TPS=915K):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 3,212,892,465,135 cycles (57.14%)
175,161,815,648 dwalkcycles (57.15%)
64,908,489,131 iwalkcycles (57.15%)
3,579,819,559 dwalkmiss (57.15%)
2,108,905,920 iwalkmiss (57.15%)
21,031,821 dmemloads (57.15%)
85,002 imemloads (57.14%)
30.004624838 seconds time elapsed
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">perf stat</code> with huge pages for code/data (TPS=952K):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 3,213,313,736,349 cycles (57.15%)
92,547,731,364 dwalkcycles (57.15%)
22,334,822,336 iwalkcycles (57.15%)
1,611,692,765 dwalkmiss (57.15%)
804,414,164 iwalkmiss (57.14%)
25,627,581 dmemloads (57.12%)
15,717 imemloads (57.12%)
30.006456928 seconds time elapsed
</code></pre></div></div>
<p>Compare:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">iwalkcycles</code> drop in 2.9 times, <code class="language-plaintext highlighter-rouge">dwalkcycles</code> drop in 1.9 time</li>
<li><code class="language-plaintext highlighter-rouge">iwalkmiss</code> - 2.6 times, <code class="language-plaintext highlighter-rouge">dwalkmiss</code> - 2.2 times</li>
<li>TPS: +4%</li>
</ul>
<p>Summary: our aircraft has successfully taken off, the flight is normal:</p>
<p><img src="/assets/images/elfremapper/takeoff.jpg" alt="takeoff" /></p>
<h1 id="whats-next">What’s next?</h1>
<p>Well, the technology of remapping code and data sections to huge pages has right to life given the current state of Linux kernel API and <code class="language-plaintext highlighter-rouge">glibc</code> library. Although, after thinking over everything written here, one simple idea comes to my mind. Why do I need to remap anything? Why not to create the huge mapping in the first place?</p>
<p>Daniel Black from the MariaDB offered the simple and elegant solution - make all the work right inside the <code class="language-plaintext highlighter-rouge">glibc</code> dynamic linker. I can see here only one obstacle - how to start the application? By default, its LOAD segments are loaded by kernel, and changing the kernel is something I want to steer clear of. Meanwhile, the dynamic linker is capable of running applications by itself! Have you ever tried to run dynamic linker as an application? Yeah, it’s the DSO indeed, but at the same time it’s runnable too:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ /lib64/ld-linux-x86-64.so.2
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives
in executable files using ELF shared libraries tell the system's program
loader to load the helper program from this file. This helper program loads
the shared libraries needed by the program executable, prepares the program
to run, and runs it. You may invoke this helper program directly from the
command line to load and run an ELF executable file; this is like executing
that file itself, but always uses this helper program from the file you
specified, instead of the helper program file specified in the executable
file you run. This is mostly of use for maintainers to test new versions
of this helper program; chances are you did not intend to run this program.
...
$ /lib64/ld-linux-x86-64.so.2 /bin/echo "HELLO, WORLD"
HELLO, WORLD
</code></pre></div></div>
<p>Advantages of this approach are undeniable:</p>
<ul>
<li>no need to write additional code to application or create a separate DSO</li>
<li>huge pages for LOAD segments is available not only for application but for any other DSO loaded by dynamic linker</li>
<li>loading DSO to huge pages becomes dynamic: the same code is invoked in both cases - while application starts and inside <code class="language-plaintext highlighter-rouge">dlopen</code> call.</li>
</ul>
<p>Attempt to create a “dirty” patch for our local <code class="language-plaintext highlighter-rouge">glibc</code> fork revealed only one nasty feature - excessive memory consumption. The fact is, the ordinary system DSOs have very small LOAD segments. From time to time, even 4K page is superfluous for them, I don’t speak about 2M page. Moreover, each system DSO has several LOAD segments inside (remember about security). As a result, too much memory is wasted. For most of the ordinary system DSOs the default 4K pages is a nice fit, standard 4K TLB records make job done perfectly. That’s why for dynamic linker a special filter is needed, for example, an environment variable with the list of DSOs which are needed to be put on huge pages along with application itself.</p>
<p>Well, if I get free time, I’ll finish my work with dynamic linker and tell you more about my adventures in the <code class="language-plaintext highlighter-rouge">glibc</code> community. People talk contributing the patch to <code class="language-plaintext highlighter-rouge">glibc</code> main source tree is a nontrivial and extremely hard task.</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I’d like to say many thanks to the Cloud DBS team in <a href="https://career.huawei.ru/rri/">Huawei Russian Research Institute</a>, which took a great part in active design, research and code review.</p>
<h1 id="to-the-reader">To the reader</h1>
<p>If you have some comments or observations, you’ve found a clear error or typo, there’s missed reference or the copyright is violated, please, leave a note or notify me by all available means, I’d be happy to fix, refine or append the article.</p>
<p>Source code, forged with blood and sweat, is published <a href="https://github.com/dmitriy-philimonov/elfremapper">here</a>.</p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://wiki.debian.org/Hugepages">https://wiki.debian.org/Hugepages</a></li>
<li><a href="https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt">https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt</a></li>
<li><a href="https://www.1024cores.net/home/in-russian/ram---ne-ram-ili-cache-conscious-data-structures">https://www.1024cores.net/home/in-russian/ram---ne-ram-ili-cache-conscious-data-structures</a></li>
<li><a href="https://medium.com/applied/applied-c-memory-latency-d05a42fe354e">https://medium.com/applied/applied-c-memory-latency-d05a42fe354e</a></li>
<li><a href="https://yandex.ru/images">https://yandex.ru/images</a></li>
<li><a href="https://e7.pngegg.com/pngimages/908/632/png-clipart-man-wearing-black-jacket-illustration-morpheus-the-matrix-neo-red-pill-and-blue-pill-youtube-good-pills-will-play-fictional-character-film.png">https://e7.pngegg.com/pngimages/908/632/png-clipart-man-wearing-black-jacket-illustration-morpheus-the-matrix-neo-red-pill-and-blue-pill-youtube-good-pills-will-play-fictional-character-film.png</a></li>
<li><a href="https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/">https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/</a></li>
<li><a href="https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/">https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/</a></li>
<li><a href="https://bugs.mysql.com/bug.php?id=101369">https://bugs.mysql.com/bug.php?id=101369</a></li>
<li><a href="https://jira.mariadb.org/browse/MDEV-24051">https://jira.mariadb.org/browse/MDEV-24051</a></li>
<li><a href="https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html">https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html</a></li>
<li><a href="https://man7.org/linux/man-pages/man7/vdso.7.html">https://man7.org/linux/man-pages/man7/vdso.7.html</a></li>
<li><a href="https://man7.org/linux/man-pages/man3/dl_iterate_phdr.3.html">https://man7.org/linux/man-pages/man3/dl_iterate_phdr.3.html</a></li>
<li><a href="https://github.com/dmitriy-philimonov/elfremapper">https://github.com/dmitriy-philimonov/elfremapper</a></li>
<li><a href="https://github.com/akopytov/sysbench">https://github.com/akopytov/sysbench</a></li>
<li><a href="https://i1.wp.com/freethoughtblogs.com/affinity/files/2016/06/facepalm_estatua.jpg">https://i1.wp.com/freethoughtblogs.com/affinity/files/2016/06/facepalm_estatua.jpg</a></li>
<li><a href="https://pbs.twimg.com/media/EroZF0DXYAIDYI4.jpg">https://pbs.twimg.com/media/EroZF0DXYAIDYI4.jpg</a></li>
<li><a href="https://i.ytimg.com/vi/6K8hc4aFwCg/maxresdefault.jpg">https://i.ytimg.com/vi/6K8hc4aFwCg/maxresdefault.jpg</a></li>
</ul>Dmitriy PhilimonovA story about using huge pages to boost MySQL performanceAdaptive Thread Pool: Improving MySQL Scalability With AI2021-12-28T15:00:00+03:002021-12-28T15:00:00+03:00https://mysqlperf.github.io/mysql/adaptive-thread-pool<p>In our <a href="/mysql/simulation-of-threadpool-for-database-server/">previous blog post</a> we discussed the purpose of a thread pool, various approaches to implementing a thread pool, along with a simulation model describing thread pool implementations in MariaDB and Percona Server. In this post we will look into another methodology of tuning the thread pool size, namely the adaptive Hill Climbing algorithm.</p>
<p>Similarity and difference between the two approaches can be commonly characterized as follows. In the previous post, the output of the given $\mathrm{TPSize}$ was calculated by a model invocation, where model is a standalone program. In this blog post, the output of the given $\mathrm{TPSize}$ is taken from a running database server in real-time and is immediately used for further tuning of the thread pool. So the main difference can be described as offline and online optimization, respectively.</p>
<h2 id="preface">Preface</h2>
<p>The problem of the optimal thread pool size choice has continued to be an actual and important over the past few decades. The main goal of such optimization is maximizing throughput on the one hand, and minimizing resource consumption on the other hand. Adaptive solutions to this problem have been under active development during the last years. This blog post gives an overview of a successful application of those solutions to MySQL thread pool, which, however, does not limit its commonality for other software systems. The thread pool implementations in MariaDB and Percona Server were taken as a basis. The well-known Hill Climbing algorithm was used, while background and signal processing procedures produce data for taking decisions. This post contains several algorithmic heuristics and refinements to get realistic results for different types of workloads, differentiating it on criteria of whether the workload is heavy/low and CPU-bound/IO-bound. It is shown that performance improvement reaches more than 40% due to the optimal choice of the thread pool size with the adaptive approach. Finally, we provide proposals on further application of methods based on AI and machine learning for multi-dimensional optimizations.</p>
<p>The optimal value of $\mathrm{TPSize}$ depends on multiple factors in a complex way:</p>
<ul>
<li>Number of client requests</li>
<li>Number of CPU cores</li>
<li>Amount of memory</li>
<li>Response time (request duration)</li>
</ul>
<p>So how do you choose an optimal $\mathrm{TPSize}$? Guide <a href="#15">[15]</a> suggests the following:</p>
<blockquote>
<p>The size computation takes into account the number of client requests to be processed concurrently, the resource (number of CPUs and amount of memory) available on the machine and the response times required for processing the client requests. Setting the size to a very small value can affect the ability of the server to process requests concurrently, thus affecting the response times since requests will sit longer in the task queue. On the other hand, having a large number of worker threads to service requests can also be detrimental because they consume system resources, which increases concurrency. This can mean that threads take longer to acquire shared structures, thus affecting response times.</p>
</blockquote>
<p>There are simple expressions, repeated in many works (for example <a href="#11">[11]</a> and <a href="#16">[16]</a>), which give an intuitively clear way to choose a rather good approximation in many cases. The first one is:</p>
\[N_{threads} = N_{cores}\cdot(1 + \frac{W}{S})\]
<p>where</p>
<ul>
<li>$N_{threads}$ – number of threads;</li>
<li>$N_{cores}$ – number of available cores;</li>
<li>$W$ – waiting time, that is the time spent waiting for IO bound tasks to complete;</li>
<li>$S$ – service time, that is the time spent being busy;</li>
<li>$\frac{W}{S}$ – ratio, that is often called the blocking coefficient.</li>
</ul>
<p>The second one uses a fundamental result from the queuing theory. Little’s law says that the number of requests in a system equals the rate at which they arrive multiplied by the average amount of time it takes to service an individual request: $N=λ * W$.
We can use Little’s law to determine the thread pool size. All we have to do is to measure the rate at which requests arrive and the average amount of time to service them. We can then plug those measurements into Little’s law to calculate the average number of requests in the system.</p>
<p>These assumptions were good enough for small software systems, but they are insufficient on the level of large industrial applications. Indeed, although they require to collect dynamic run-time information, they cannot be used to maximize throughput, because they know nothing about it. Nothing in them guarantees that we get an extremum point of the “concurrency level – throughput” dependency, and possible performance loss can be notable. But what do we know about that dependency? Why can we say that the task of looking for an optimum point is sensible? Why can too few threads be a bad choice, or vice versa, why can too many threads be a bad choice as well? How that dependency behaves itself in general? The following table illustrates the answer.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/adaptive_thread_pool/5.png" alt="5" style="width: 100%;" /> <strong>Too few threads:</strong> Reduce the ability of the server to process requests concurrently, thus affecting the response times since requests will sit longer in the task queue. <em>CPU is free, but there are no threads to utilize it</em>.</td>
<td><img src="/assets/images/adaptive_thread_pool/6.jpg" alt="6" style="width: 100%;" /> <strong>Too many threads:</strong> Firstly, the overhead of context switching. Secondly, threads compete for system resources, thus taking longer to acquire shared structures and affecting response times.</td>
</tr>
</tbody>
</table>
<p>We can see the two typical patterns of the learning dependency on figures 1 and 2.</p>
<table>
<tbody>
<tr>
<td>Figure 1: inherent to high connections (heavy load)</td>
<td>Figure 2: inherent to low connections (light load)</td>
</tr>
<tr>
<td><img src="/assets/images/adaptive_thread_pool/7.png" alt="7" /> First grows (threads are utilizing CPUs), then falls (threads interfere with each other).</td>
<td><img src="/assets/images/adaptive_thread_pool/8.jpg" alt="8" /> First grows (threads are utilizing as many CPUs as the light load allows), then constant from the “knee” point (no work items for other threads, they are idle and useless).</td>
</tr>
</tbody>
</table>
<p>A good example when <strong>manually</strong> adjusting the thread pool size may improve performance in some workloads, while decreasing it in other workloads can be found here <a href="#3">[3]</a>. That’s why it is important for the thread pool to be able to adapt itself to changes in concurrency and workloads.</p>
<h2 id="adaptive-approach">Adaptive Approach</h2>
<p>So how to explore that knowledge? More precisely, what would be an effective approach to dynamically adapt $\mathrm{TPSize}$ in order to increase throughput while minimizing the amount of used threads? An adaptive approach will help us solve this problem and it is based on the Hill Climbing <a href="#9">[9]</a> optimization method. If you have never heard of it before, below is a brief explanation.</p>
<h3 id="hill-climbing-in-general-what-is-it">Hill Climbing In General: What Is It?</h3>
<p>The Hill Climbing method is an optimization technique that is able to build a search trajectory in the search space until reaching the local optimum. It can be considered as a general class of heuristic optimization algorithms that deal with the following optimization problem.
There is a finite set $X$ of possible configurations. Each configuration is assigned a non-negative real number called <em>cost</em> or, in other words, a <em>cost function</em> is defined as: $f : X \rightarrow R$. For each configuration $x \in X$, a set of neighbors $\eta(x) \subset X$ is defined. Let’s assume without restriction of generality that our goal is to maximize the cost function. The aim of the search is to find $x_{max} \in X$ maximizing the cost function $f(x), f(x_{max})=max{f(x) : x \in X}$ by moving from one neighbor to another depending on the cost difference between the neighboring configurations.</p>
<p><img src="/assets/images/adaptive_thread_pool/10.png" alt="10" class="align-center" /></p>
<p>Let us list the typical steps of Hill Climbing in more details. Let’s assume for generality that the configuration is not a scalar but a vector, so we optimize the vector value.</p>
<ul>
<li><strong><em>Step 1</em></strong>. <em>Initialization of algorithm</em>. Randomly create one candidate solution $\overrightarrow{x_{0}}$, depending on the length $\overrightarrow{x}$.</li>
<li><strong><em>Step 2</em></strong>. <em>Evaluation</em>. Create a cost function $f(\overrightarrow{x_{0}})$ to evaluate the current solution. The first iteration is as follows:</li>
</ul>
\[\overrightarrow{x_{*}} = \overrightarrow{x_{0}}, f_{max}=f(\overrightarrow{x_{*}})\]
<ul>
<li><strong><em>Step 3</em></strong>. <em>Mutation</em>. Mutate the current solution $\overrightarrow{x_{*}}$ by one and evaluate the new solution $\overrightarrow{x_{i}}$.</li>
<li><strong><em>Step 4</em></strong>. <em>Selection</em>. If the value of the cost function for the new solution is better than for the current solution, replace as follows:</li>
</ul>
\[f(\overrightarrow{x_{i}}) > f(\overrightarrow{x_{*}}) \iff \overrightarrow{x_{*}} =\overrightarrow{x_{i}}\]
<ul>
<li><strong><em>Step 5</em></strong>. <em>Termination</em>. When there is no improvement in cost function after a few iterations.</li>
</ul>
<p>The key step which determines the variety of the Hill Climbing heuristics is step 3 which is essentially <em>mutate the current solution</em>. How to mutate? It depends on what a researcher has thought up, proven and proposed. And as we will see further in this post when examining our problem, step 4 is also not quite deterministic.</p>
<p class="notice--info"><strong>Note a very important limitation of Hill Climbing: it converges to the nearest (as a rule) local optimum by its nature; it cannot be applied to a global optimum search. That is why the most appropriate application area for it are convex and concave cost functions.</strong></p>
<div class="notice--primary">
<p>State-of-the-art of the Hill Climbing family of algorithms in the middle of 90s of the last century can be found in the fundamental work <a href="#12">[12]</a>. Theoretical development of basic algorithms has been continued in the current century. A special stochastic version of Hill Climbing was proposed in <a href="#2">[2]</a> to overcome the problem of getting stuck to a local optimum. <a href="#10">[10]</a> extends the application area of Hill Climbing to such kind of problems as the hierarchical composition problem in order to choose the most appropriate neighbors for build blocks. A noticeable innovation was proposed in <a href="#23">[23]</a>. The main idea is that not only the search direction of the mutation is chosen randomly, but the subject area itself where we look for something is probabilistic. Thus, we improve the current solution not with probability one! Perhaps, this approach could be used after proper thinking for our problem too.</p>
<p>The variety of applied problems where Hill Climbing has been applied is very wide. Let’s note some interesting papers. <a href="#17">[17]</a> is devoted to the Graph Drawing problem. It addresses the problem of finding a representation of a graph that satisfies a given aesthetic objective, for example, embedding of its nodes in a target grid. Educational paper <a href="#20">[20]</a> considers Hill Climbing with respect to such well-known tasks of discrete optimization as scheduling with constraints (do not interfere on class-room access, on time of pupils’ groups, on time of lecturers); the problem of eight queens; traveling salesperson problem and others. It has been shown that Hill Climbing provides some advantages compared to the more classical methods, for example, limited amount of memory (because only the current state is stored) and ease of implementation. <a href="#4">[4]</a> applies Hill Climbing to cryptoanalysis, in particular, to the problem of recovering the internal state of the Trivium cipher. <a href="#8">[8]</a> considers permutation-based combinatorial optimization problems, such as the Linear Ordering problem and the Quadratic Assignment problem. <a href="#1">[1]</a> studies the problem of cluster analysis of Internet pages: how to map two or more pages on the same cluster. Authors solved two dualistic tasks: finding the minimum distance between each document in the dataset with cluster centroids and maximizing the similarity between each document with cluster centroids. Finally, <a href="#14">[14]</a> gives an example of Hill Climbing for continuous multi-dimensional problem of PID (Proportional-Integral-Derivative) tuning, where the aim is to tune the controller when control loop’s process value walks on significant excursion from the set point. The features of this task are more than one dependent variable and a very large search space.</p>
</div>
<p>Going back to thread pools, what is a configuration and what is a cost function for them? The configuration consists of one value which is $\mathrm{TPSize}$. The cost function returns the average throughput over a given time period with a given $\mathrm{TPSize}$, expressed for the database server in transactions per second (TPS). We can already describe our general plan in the following way: we observe and measure over time the changes in throughput as a result of adding or removing threads, then decide whether to add or remove more threads based on the observed throughput degradation or improvement. But how to do that?</p>
<p>First of all, let’s note two fundamental aspect of our subject area:</p>
<ul>
<li><em>the cost function is not exactly defined</em>. As a database server is a very complex system influenced by many varying factors, two values measured over two different time periods will never match. We can only say whether the difference is statistically valuable or not. This is a bad aspect;</li>
<li><em>the cost function is concave</em> (see figures 1 and 2 again). That is why Hill Climbing can be applied in principle, and it only remains to decide how to apply it. This is a good aspect.</li>
</ul>
<p>To make the decision, we have to remember about our goals:</p>
<ul>
<li>
<dl>
<dt>Primary goals:</dt>
<dd>
<ul>
<li>maximize throughput measured in completed transactions per second;</li>
</ul>
</dd>
<dd>
<ul>
<li>minimize thread pool size for pattern 2;</li>
</ul>
</dd>
<dd>
<ul>
<li>ensure convergence for both patterns from any initial value of the thread pool size;</li>
</ul>
</dd>
</dl>
</li>
<li>
<dl>
<dt>Secondary goals:</dt>
<dd>
<ul>
<li>detect a significant change in the workload and reset the iteration process;</li>
</ul>
</dd>
<dd>
<ul>
<li>minimize the convergence time;</li>
</ul>
</dd>
<dd>
<ul>
<li>minimize the overhead of a dynamic thread pool resizing;</li>
</ul>
</dd>
<dd>
<ul>
<li>make implementation configurable by the user with meaningful parameters.</li>
</ul>
</dd>
</dl>
</li>
</ul>
<p>It should be noted that the most significant results in the adaptive thread pool approach over the last decade were obtained by Microsoft researchers. We will refer mostly to that work below.</p>
<h3 id="hill-climbing-and-thread-pool-control-theory">Hill Climbing and Thread Pool: Control Theory</h3>
<p>In this subsection we consider the technique described in papers <a href="#6">[6]</a> and <a href="#7">[7]</a>. It resembles gradient descent method <a href="#18">[18]</a>, although strictly speaking it is not. The fact is that one iteration of the gradient descent changes all components of the configuration vector, however Hill Climbing changes only one of them in accordance with the chosen direction. But since configuration for the thread pool problem is one value, we can consder in fact the control theory technique as a gradient descent method after all. The iterative procedure (mutation) is the following:</p>
\[x_{k+1} = x_{k} + sign(\Delta_{km})\lceil a_{km}|\Delta_{km}| \rceil \\
\Delta_{km} = \frac{\overline{y}_{km}-\overline{y}_{k-1}}{x_{km}-x_{k-1}} \\
a_{km} = e^{-s_{km}}\frac{g}{\sqrt{k+1}}\]
<p>where</p>
<ul>
<li>$\overline{y_{k-1}}$ is the value of throughput (cost function), calculated at $x_{k-1}$;</li>
<li>$ \overline{y_{km}}$ is the value of throughput (cost function), averaged over $m$ sequential calculations at $x_{k}$;</li>
<li>$s_{km}$ is the standard deviation of the sample mean of throughput values collected at $x_{k}$;</li>
<li>$g$ is the control gain, the default value is 5.</li>
</ul>
<p>The term $\Delta_{km}$ can be thought of as a “gradient”. Results, usage experience and practical suggestions are described in <a href="#22">[22]</a>.</p>
<p>The main shortcoming of this method is that the measurements are noisy (Fig. 3), and the method does not handle it very well. The noise factor makes statistical information not representative of the actual situation, unless it’s taken over a large time interval, that is also unacceptable in practice.</p>
<p><img src="/assets/images/adaptive_thread_pool/11.png" alt="11" />
<strong>Figure 3:</strong> noise cost function: thread pool size is the constant, but throughput does fluctuate</p>
<p><a href="#5">[5]</a> says the following about this method:</p>
<blockquote>
<p>Its use was particularly problematic because of the difficulty in detecting small variations or extracting changes from a very noisy environment over a short time. The first problem observed with this approach is that the modeled function isn’t a static target in real-world situations, so measuring small changes is hard. The next issue, perhaps more concerning, is that noise (variations in measurements caused by the system environment, such as certain OS activity, garbage collection and more) makes it difficult to tell if there’s a relationship between the input and the output, that is, to tell if the throughput isn’t just a function of the number of threads. In fact, in the thread pool, the throughput constitutes only a small part of what the real observed output is—most of it is noise.</p>
</blockquote>
<p>That is why in this method iterations turn into random walks often and do not bring performance any closer to the optimum point.</p>
<p>In this regard, it is worth mentioning the paper <a href="#21">[21]</a>, where a simplified version of the described approach is proposed. The diagram of state transitions (Fig. 4) gives an idea of it.</p>
<p style="text-align: center;"><a href="/assets/images/adaptive_thread_pool/12.png" title="Figure 4: A simplified version of the control theory approach"><img src="/assets/images/adaptive_thread_pool/12.png" alt="12" /></a>
Figure 4: A simplified version of the control theory approach</p>
<p>In theory, this approach is reasonable, but in practice it suffers even more than the previous one from the same shortcomings due to the mutation and selection algorithms being too primitive. Let’s move on to the next methodology.</p>
<h3 id="hill-climbing-and-thread-pool-signal-processing">Hill Climbing and Thread Pool: Signal Processing</h3>
<p>This approach is briefly described in <a href="#5">[5]</a> without details. The key idea is that we treat the input (concurrency level) and output (throughput) as signals. If we input a purposely modified concurrency level as a “wave” with known period and amplitude, and then look for that original wave pattern in the output, we can separate noise from the actual effect of the input on throughput. We introduce a signal and then try to find it in the noisy output. This effect can be achieved by using techniques generally used for extracting waves from other waves or finding specific signals in the output. This also means that by introducing changes to the input, the algorithm is making decisions at every point based on the last small piece of input data. Algorithm uses a discrete Fourier transform, a methodology that gives information such as the magnitude and the phase of a wave. This information can then be used to see if and how the input affected the output.</p>
<p>Let’s describe the basic decisive idea of the signal processing technique. There are two data rows (<em>waves</em>), which are a sequence of $\mathrm{TPSize}$ values and the corresponding sequence of throughput values. So we have already performed the <em>Mutation</em> step of Hill Climbing by varying $\mathrm{TPSize}$. Now it is the time of the <em>Selection</em> step. We have to figure out the common trend: <strong>does throughput improve or degrade with increasing of $\mathrm{TPSize}$?</strong> In the former case we will move one step up to increase $\mathrm{TPSize}$, in the latter one we will one the step down to decrease it.</p>
<p>We calculate the first Fourier harmonic of the first raw and the first Fourier harmonic of the second raw, these are both some complex numbers:</p>
\[c_{1} = \rho_{1} (\cos \varphi_{1} + i \sin \varphi_{1}) \\
c_{2} = \rho_{2} (\cos \varphi_{2} + i \sin \varphi_{2})\]
<p>Then calculate the real part of the ratio $\frac{c_{1}}{c_{2}}$:</p>
\[Re(\frac{c_{1}}{c_{2}}) = \frac{\rho_{1}}{\rho_{2}} \cos(\varphi_{1} - \varphi_{2})\]
<p>We then look at the sign of the ratio. A positive ratio means that $\varphi_{1} - \varphi_{2} < \frac{\pi}{2}$, that is why both data rows oscillate in-phase and have the same trends. In that case we increase $\mathrm{TPSize}$. A negative ratio means that $\varphi_{1} - \varphi_{2} > \frac{\pi}{2}$, thus both data rows oscillate in antiphase. Therefore, an increase of the first corresponds to a decrease of the second. In this case we decrease $\mathrm{TPSize}$.</p>
<p class="notice"><strong>To sum up, the direction of $\mathrm{TPSize}$ adjustment is determined by the sign of the real part of the first harmonics’ ratio.</strong></p>
<p>Our approach uses signal processing and it is loosely based on the open source code of .NET <a href="#25">[25]</a>. That original code is sketchy and useless in practice for real server systems.</p>
<h2 id="our-solution">Our solution</h2>
<p>The “true” behavior of the cost function in production database servers is hidden from a superficial view. Not to mention that the function itself changes in one way or another due to changes in the workload and the OS environment. That is why in order to get useful and applicable solution in practice we must identify and resolve many important issues, mainly to reject those artifacts which are caused not by $\mathrm{TPSize}$ adjustment. Let’s start from our states and transitions (Fig. 5).</p>
<p style="text-align: center;"><a href="/assets/images/adaptive_thread_pool/13.png" title="Figure 5: states and transitions"><img src="/assets/images/adaptive_thread_pool/13.png" alt="13" class="align-center" /></a>
Figure 5: states and transitions</p>
<ul>
<li><em>Usual</em> – iterations belonging to either increasing or decreasing parts of the curve to get closer to the optimal point;</li>
<li><em>Plateau</em> – iterations belonging to the constant part of the curve to get closer to the “knee” point;</li>
<li><em>Optimized</em> – no iterations, because we are on the optimal point. Сheck if the input has changed and reinit the search, if so.</li>
</ul>
<p>We can see that transitions are possible between arbitrarily ordered state pairs, so, the graph is fuly connected. It is also important to note that algorithms and conditions of transitions are the engine of our solution as well as signal processing formulas.</p>
<p>The architecture of our adaptive thread pool module as a function call graph is shown on Fig. 6.</p>
<p style="text-align: center;"><a href="/assets/images/adaptive_thread_pool/14.png" title="Figure 6: adaptive thread pool module: functions call graph"><img src="/assets/images/adaptive_thread_pool/14.png" alt="14" /></a>
Figure 6: the adaptive thread pool module: the function call graph</p>
<p>Explanations of some important functions are:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">init()</code> – initialization of the Hill Climbing class properties, memory allocation of inner data structures;</li>
<li><code class="language-plaintext highlighter-rouge">update()</code> – implements the main iterative procedure and calls all auxiliary algorithms;</li>
<li><code class="language-plaintext highlighter-rouge">dump_log()</code> – prints debug information to a file in a structured way, can be disabled by a system variable;</li>
<li><code class="language-plaintext highlighter-rouge">reinit()</code> – restart the search of the optimal thread pool size when some external conditions are changed. <code class="language-plaintext highlighter-rouge">reinit()</code> cases are:
<ul>
<li>when the caller is <code class="language-plaintext highlighter-rouge">update_on_optimized_stage()</code>: an exit from the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode;</li>
<li>when the caller is <code class="language-plaintext highlighter-rouge">estimate_real_progress()</code>: a false transition to the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode;</li>
<li>when the caller is <code class="language-plaintext highlighter-rouge">optimize()</code>: a false transition to the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode;</li>
<li>when the caller is <code class="language-plaintext highlighter-rouge">try_data_trace()</code>: a workload change is detected, the curve has changed;</li>
<li>when the caller is <code class="language-plaintext highlighter-rouge">update()</code>:
<ul>
<li>the thread pool size has been changed manually by a user;</li>
<li>no connections;</li>
<li>an unnaturally steep decline of throughput.</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="our-own-algorithm-customizations">Our own algorithm customizations</h3>
<p>In this subsection we list some heuristics and mini-solutions that help to reach the declared goals on the one hand, and to compensate some computational artifacts of Hill Climbing with respect to the thread pool on the other hand.</p>
<h4 id="false-transition-to-the-optimized-mode">False transition to the OPTIMIZED mode</h4>
<p>The condition is: \(\mathrm{TPSize_{new}} – N_{connections} > 20\) (a configurable parameter). It makes no sense to have more threads than the number of concurrent connections. A small excess is allowed, but it should not be higher than a certain value.</p>
<h4 id="false-jump">False jump</h4>
<p>The condition is: $\frac{N_{connections}}{\mathrm{TPSize_{new}}} < 0.2$ (a configurable parameter). The same logic as in the previous item is used. If the adjusted $\mathrm{TPSize}$ value exceeds the number of connections by a certain margin, we reject it.</p>
<h4 id="flexible-step-adjustment-in-the-plateau-mode">Flexible step adjustment in the <code class="language-plaintext highlighter-rouge">PLATEAU</code> mode</h4>
<p>This feature is applied when we move from the right to the left in the PLATEAU mode to find the “knee” point. The original rule $\mathrm{TPSize_{new}} = \frac{\mathrm{TPSize_{old}}}{2}$ is too coarse and results in large deviations to the left from the knee point. We have to provide a smaller value of jump down for lower \mathrm{TPSize}.</p>
<p>Let’s consider $2N$ reference points. Then we can propose the following formula:</p>
\[\mathrm{TPSize_{new}} = \mathrm{TPSize_{old}}\cdot(0.5 + \sum_{i=1}^{N} k_{2i-1}\cdot e^{-k_{2i}\cdot\mathrm{TPSize_{old}}})\]
<p>For $N=1$, we chose two points \((10; 8)\) and \((20; 15)\) and calculated from the equation system $k_{1}=0.36$ and $k_{2}=0.018$. So the smaller is the value of $\mathrm{TPSize_{old}}$, the larger is the distance between $\mathrm{TPSize_{new}}$ and $\frac{\mathrm{TPSize_{old}}}{2}$.</p>
<h4 id="workload-change-detection">Workload change detection</h4>
<p>As we mentioned above, one of the reasons of the model re-initialization is “a workload change is detected, the curve has changed”. But how do we detect that? In other words, how does our approach perform with heterogeneous workloads and changing requests submission rate? Sometimes it’s hard to tell whether an improvement was a result of a change in concurrency or due to another factor such as workload fluctuations. That is why an improvement observed in a time interval may not even be related to the change in concurrency level (figure 7 helps to illustrate this issue).</p>
<figure class="align-center">
<img src="/assets/images/adaptive_thread_pool/15.png" alt="" />
</figure>
<figure class="align-center">
<img src="/assets/images/adaptive_thread_pool/16.png" alt="" />
<figcaption>Figure 7: constant $\mathrm{TPSize}$ and growing throughput</figcaption>
</figure>
<p>The idea is simple. When the algorithm does its job, we store the calculated pairs $(\mathrm{TPSize}; \mathrm{throughput})$. When the first element of the next pair occurs between the two already known neighbor points, we predict the second element in accordance with some interpolation method, for example, even by a linear interpolation. The heuristic we apply here is that if actual throughput is too far from the predicted one, and such situation has repeated twice, then the cost function has changed and we have to re-initialize the algorithm.
Fig. 8 illustrates this feature.</p>
<figure class="align-center">
<img src="/assets/images/adaptive_thread_pool/17.jpg" alt="" />
<figcaption>Figure 8: detection that cost function has significantly changed</figcaption>
</figure>
<h4 id="entering-and-exiting-the-optimized-mode">Entering and exiting the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode</h4>
<p>Steps to enter and exit the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode are the following:</p>
<ol>
<li>Fix the throughput, the number of connections and request the average request latency when we enter the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode.</li>
<li>Check these data on each time interval (in the <code class="language-plaintext highlighter-rouge">update()</code> function call)</li>
<li>If one of them deviated more than a certain threshold, defined by a configuration parameter, the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode is no longer applicable.</li>
</ol>
<h4 id="overhead-of-the-dynamic-thread-pool-resizing">Overhead of the dynamic thread pool resizing</h4>
<p>Using features of the MySQL thread pool implementations this is not a problem. This thread pool consists of some number of thread groups that is just $\mathrm{TPSize}$. Among with other fields, each thread group structure contains the following fields:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">pollfd</code>, which is a file descriptor for listening events with the <code class="language-plaintext highlighter-rouge">io_poll_wait()</code> API and extraction of input requests;</li>
<li><code class="language-plaintext highlighter-rouge">mutex</code> to protect group fields from concurrent access.</li>
</ul>
<p>Thus, when we increase $\mathrm{TPSize}$, creating missing file descriptors (if any) is all we need to do; if we decrease $\mathrm{TPSize}$ we have nothing to do. Listing 1 illustrates that.</p>
<p style="text-align: center; font-size: 0.7em;">Listing 1.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">set_threadpool_size</span><span class="p">(</span><span class="n">uint</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">threadpool_started</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">success</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
<span class="n">uint</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">thread_group_t</span> <span class="o">*</span><span class="n">group</span> <span class="o">=</span> <span class="o">&</span><span class="n">all_groups</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">mutex_lock</span><span class="p">(</span><span class="o">&</span><span class="n">group</span><span class="o">-></span><span class="n">mutex</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">group</span><span class="o">-></span><span class="n">pollfd</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">group</span><span class="o">-></span><span class="n">pollfd</span> <span class="o">=</span> <span class="n">io_poll_create</span><span class="p">();</span>
<span class="n">success</span> <span class="o">=</span> <span class="p">(</span><span class="n">group</span><span class="o">-></span><span class="n">pollfd</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">success</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/*some message to log*/</span>
<span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">all_groups</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">mutex</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&</span><span class="n">all_groups</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">mutex</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">success</span><span class="p">)</span> <span class="n">group_count</span> <span class="o">=</span> <span class="n">size</span><span class="p">;</span>
<span class="k">else</span> <span class="n">group_count</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="jumping-down">Jumping down</h4>
<p>The original signal processing variant of Hill Climbing for the adaptive thread pool can oscillate only forward from the current $\mathrm{TPSize}$ value, not backwards. That is why if the spectral analysis of two waves has shown an improvement, there no questions with the <em>Selection</em> step: we just add to the current $\mathrm{TPSize}$ value the current magnitude since that value has proven to be better. But what about degradation? It is clear that $\mathrm{TPSize}$ needs to be decreased, but how much? This question can be answered only with some plausible heuristics. The simplest and natural idea is to use the absolute value of the real part of the first harmonics ratio in a way to make the decrement value proportional. We can use the previously obtained correspondences between already made forward jumps and their real parts for scaling.</p>
<h3 id="user-configurable-parameters-for-adaptive-thread-pool">User configurable parameters for adaptive thread pool</h3>
<p>For experimental purposes we introduce 33 new parameters to fine tune the Hill Climbing iteration process and decision making. Most of them will never need to be changed in production. The most important ones to tune or debug the module in rare cases:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">hcm_hillclimbing_enabled</code> – switch on/off the adaptive thread pool module. Default is false;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_log_enabled</code> – switch on/off log file dumping. Default is false;</li>
<li><code class="language-plaintext highlighter-rouge">wave_period</code> – period of forced oscillation of the thread pool size. Default is 4;</li>
<li><code class="language-plaintext highlighter-rouge">samples_to_wave_period_ratio</code> – defines the history size of the previous thread pool size and throughput values, which are taken for consideration by the adaptive thread pool engine. Default is 8, so we take two vectors with 32 values each;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_period</code> – the time interval in seconds between two sequential calls of the <code class="language-plaintext highlighter-rouge">update()</code> function. In other words, it is the sampling interval for the adaptive algorithm. Default value is 2;</li>
<li><code class="language-plaintext highlighter-rouge">min_accepted_throughput</code> – if the current throughput has dropped below this threshold, we suspend the adaptive thread pool module as using it is not practical. Default value is 300;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_eps</code> – accuracy of the optimal value search. The optimal concurrency is considered as found, if the distance between the current upper and lower boundaries have become less than <code class="language-plaintext highlighter-rouge">hcm_eps</code>. Default is 5;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_valuable_diff</code> – the maximum deviation (in percent) of one value from another (either events per second or average latency). Default is 20%. Used to exit the <code class="language-plaintext highlighter-rouge">OPTIMIZED</code> mode or to detect workload changes;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_ccs_valuable_progress</code> – triggers a change in the thread pool size when the absolute accumulated sum of $Re(\frac{c_{1}}{c_{2}})$ reaches this threshold. Default is 0.2;</li>
<li><code class="language-plaintext highlighter-rouge">hcm_max_thread_wave_magnitude</code> – the upper boundary for $\mathrm{TPSize}$ oscillation magnitude, which we gradually increase from <code class="language-plaintext highlighter-rouge">hcm_min_thread_wave_magnitude</code> (default is 10). If the specified magnitude has been reached, we conclude that there is no growing/falling trend and switch to the <code class="language-plaintext highlighter-rouge">PLATEAU</code> mode. Default is 80.</li>
</ul>
<h2 id="testing">Testing</h2>
<p>When it comes to testing of our Hill Climbing module it would be logical to check it first on some deterministic simple cost functions with an explicit and the only point of maximum like this (Fig. 9).</p>
<figure class="align-center">
<img src="/assets/images/adaptive_thread_pool/18.png" alt="" />
<figcaption>Figure 9: example of simple cost function for primary Hill Climbing tests</figcaption>
</figure>
<p>Such tests are needed to eliminate the most coarse bugs and estimate convergence time. We must convince that our Hill Climbing procedure converges to the optimum point from any initial value, either to the right or left from the optimum point. Until these tests are passed, there is no sense in testing on a real database server with a thread pool. The key features of such a simple test suite are the following:</p>
<ul>
<li>an artificially created table defining the cost function and matching one of two patterns. The correct answer is known in advance;</li>
<li>a linear or spline interpolation in intermediate points;</li>
<li>the hill climbing engine as a standalone program, not embedded into a database server;</li>
<li>simple, fast and easy tests.</li>
</ul>
<p>If all is OK with tests on simple framework it is time to move on to real database server with <em>Sysbench</em> framework. It was developed by Russian researcher Alexey Kopytov and described in many sources, for example, <a href="#13">[13]</a>.
The goal of this testing is to try the adaptive thread pool on a real MySQL server with a synthetic workload that is close to realistic; to find possible artifacts; to refine auxiliary algorithmic features and finally, to evaluate performance improvements for different types of workloads. Base configuration is the following:</p>
<ul>
<li>HWSQL server 8.0;</li>
<li>10 tables with one million records in each;</li>
<li>different workload profiles such as <em>point select</em>, <em>read only</em> and <em>read write</em>;</li>
<li>different number of connections such as 1, 4, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024;</li>
<li>10-minute test duration for each concurrency level;</li>
<li>the variable <code class="language-plaintext highlighter-rouge">hcm_log_enabled</code> is switched on for subsequent exhaustive log file analysis.</li>
</ul>
<p>The following hardware profile was used for <em>Sysbench</em> tests:</p>
<ul>
<li>Ubuntu 20.04, x86_64, GNU/Linux;</li>
<li>Intel® Xeon® Gold 6151 CPU@3.00GHz;</li>
<li>72 CPUs;</li>
<li>628 Gb of memory.</li>
</ul>
<p>It should be noted that the adaptive thread pool does not give a noticeable performance improvement for the <em>sysbench/ps</em> and <em>sysbench/ro</em> workloads as well as for CPU-bound workloads in general, although it minimizes $\mathrm{TPSize}$ for them, thus minimizing the resource usage for that kind of workloads. But for the <em>sysbench/rw</em> and <em>sysbench/TPC-C</em> workloads it improves performance by more than 40%. Let’s see in in our results.</p>
<h2 id="results">Results</h2>
<p>Before demonstrating the results of performance experiments let’s illustrate some profiling data, namely, the difference in distribution between Off-CPU time for CPU-bound and IO-bound workloads (Fig. 10, 11).</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/adaptive_thread_pool/19.jpg" title="Figure 10: off-CPU time for CP-bound workload"><img src="/assets/images/adaptive_thread_pool/19.jpg" alt="19" /></a> Figure 10: off-CPU time for CPU-bound workload</td>
<td><a href="/assets/images/adaptive_thread_pool/20.jpg" title="Figure 11: off-CPU time for IO-bound workload"><img src="/assets/images/adaptive_thread_pool/20.jpg" alt="20" /></a> Figure 11: off-CPU time for IO-bound workload</td>
</tr>
</tbody>
</table>
<p>Off-CPU time is the time interval between the <code class="language-plaintext highlighter-rouge">threadpool::wait_begin()</code> and <code class="language-plaintext highlighter-rouge">threadpool::wait_end()</code> calls. As we can see comparing the data of the x-axis, that this time is 15 times higher for the IO-bound workload. And optimal $\mathrm{TPSize}$ much higher than the number of CPUs is typical for IO-bound workloads with longer off-CPU times. The optimal $\mathrm{TPSize}$ which is close to the number of CPUs is typical for CPU-bound workloads. This fact is confirmed by table below and explained in Fig. 12.</p>
<p>For each pair (connections; profile), three 10-minutes sysbench run were launched:</p>
<ol>
<li>With <code class="language-plaintext highlighter-rouge">hillclimbing_enabled=on</code>, the thread pool size is 72 (the initial value). The result is the optimal value of $\mathrm{TPSize}$ (column <em>opt</em>);</li>
<li>With <code class="language-plaintext highlighter-rouge">hillclimbing_enabled=off</code>, the thread pool size is 72 (a constant value). The result is the average throughput in queries per second (column <em>usual</em>);</li>
<li>With <code class="language-plaintext highlighter-rouge">hillclimbing_enabled=off</code>, the thread pool size is <em>opt</em> (a constant value). The result is the average throughput in queries per second (column <em>contr</em>);</li>
<li>Column <em>diff</em> contains a diff of the <em>contr</em> column compared to the <em>usual</em> column.</li>
</ol>
<table>
<tbody>
<tr>
<td><a href="/assets/images/adaptive_thread_pool/21.png" title="Table"><img src="/assets/images/adaptive_thread_pool/21.png" alt="21" /></a></td>
<td><a href="/assets/images/adaptive_thread_pool/22.png" title="Figure 12: why extra threads are useful with IO-bound workload"><img src="/assets/images/adaptive_thread_pool/22.png" alt="22" /></a> Figure 12: why extra threads are useful with IO-bound workload</td>
</tr>
</tbody>
</table>
<p>Figures 13 and 14 illustrate performance improvements provided by the adaptive thread pool for the sysbench/rw workload. The number of connections is depicted on the x-axis in both figures.</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/adaptive_thread_pool/23.png" title="Figure 13: average throughput (transactions per second)"><img src="/assets/images/adaptive_thread_pool/23.png" alt="23" /></a> Figure 13: average throughput (transactions per second)</td>
<td><a href="/assets/images/adaptive_thread_pool/24.png" title="Figure 14: average latency (miillisecond)"><img src="/assets/images/adaptive_thread_pool/24.png" alt="24" /></a> Figure 14: average latency (millisecond)</td>
</tr>
</tbody>
</table>
<p>The fact that performance improvements on the pictures are slightly less than the ones declared in table should not be surprising because this experiment differs from the previous one. Pictures are built over 10-minute runs, where each one started from the same initial value and includes the time of hill climbing convergence. Thus, it worked with optimal $\mathrm{TPSize}$ not all of the time.</p>
<p>Figures 15 and 16 are equivalents of 13 and 14 for the sysbench/tpc-c workload.</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/adaptive_thread_pool/25.png" title="Figure 15: average throughput (transactions per second)"><img src="/assets/images/adaptive_thread_pool/25.png" alt="25" /></a> Figure 15: average throughput (transsactions per second)</td>
<td><a href="/assets/images/adaptive_thread_pool/26.png" title="Figure 16: average latency (millisecond)"><img src="/assets/images/adaptive_thread_pool/26.png" alt="26" /></a> Figure 16: average latency (millisecond)</td>
</tr>
</tbody>
</table>
<p>Figures 17 and 18 contain dependencies of the optimal $\mathrm{TPSize}$ value on various concurrency levels for different types of workload. These figures illustrate such feature of our solution as minimizing $\mathrm{TPSize}$ for low concurrency, which is more inherent to graphs on the right picture. We can see that even for the sysbench/ps and sysbench/ro workloads the algorithm finds the optimal $\mathrm{TPSize}$ value which is higher that the number of cores for high concurrency. Which is reasonable, because the found optimal values give a small, but still visible increase of performance for those workloads.</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/adaptive_thread_pool/27.png" title="Figure 17: sysbench/rw and sysbench/tpc-c"><img src="/assets/images/adaptive_thread_pool/27.png" alt="27" /></a> Figure 17: sysbench/rw and sysbench/tpc-c</td>
<td><a href="/assets/images/adaptive_thread_pool/28.png" title="Figure 18: sysbench/ps and sysbench/ro"><img src="/assets/images/adaptive_thread_pool/28.png" alt="28" /></a> Figure 18: sysbench/ps and sysbench/ro</td>
</tr>
</tbody>
</table>
<h2 id="generalization-and-future-work">Generalization And Future Work</h2>
<p>And finally a few words about machine learning (ML) approach in databases and how it relates to our solution. This approach has been actively researched in the last years <a href="#24">[24]</a>. The current state is the following. Some separate software module (for example, a thread pool) is configured by several parameters, chosen by developers or DBA. All combinations of those parameters as well as each of them in particular affects the output, which is some performance measure. Thus, if some tuple of parameter values gives the maximum performance, the goal is to find the optimal tuple of parameters.
The optimal tuple varies in a wide range depending on various items, such as:</p>
<ul>
<li>server’s hardware configuration (number and types of CPU, volume of RAM and swap partition, etc.);</li>
<li>operation system and job scheduling algorithms;</li>
<li>current load from clients in the sense of quantity (concurrency levels);</li>
<li>current load from clients in the sense of types (distribution of request lengths and availability of consumed resources, such as CPU and disk).</li>
</ul>
<p>To find the optimal tuple on a given server multi-dimensional search set, the Hill Climbing method can be applied. Any found optimal tuple corresponds to some fixed workload on the given server. If we describe the profile of that workload more or less completely, we can expect that the next time when the profile will be approximately the same, we already know the optimal tuple. Just this idea is the cornerstone of the proposed approach.</p>
<p>The general plan is:</p>
<ul>
<li>profile the code in a proper way and collect workload data when the adaptive algorithm is active;</li>
<li>put the found optimal tuple into the correspondence of collected data, thus getting a new record of the training dataset;</li>
<li>when the training dataset will become large enough, we train our ML model on it, then try to apply this model before the adaptive algorithm completes its work.</li>
</ul>
<p>So, collection of the workload profile test data for the ML model takes much less time than the convergence of the adaptive algorithm’s iterative process.</p>
<p>According to <a href="#19">[19]</a>, we can perform ML procedures just in database by means of the proposed SQL extension. We do not need to address external ML tools after the extraction of training dataset from the database. It seems that when implemented, the results of that work will significantly improve the efficiency of the described approach.</p>
<p>Let’s give an example of a thread pool tuple:</p>
<ul>
<li><em>oversubscribe</em> – defines the maximum number of active threads in one group;</li>
<li><em>timer_interval</em> – the time interval before activities of the <code class="language-plaintext highlighter-rouge">Timer</code> thread;</li>
<li><em>queue_put_limit</em> – wake or create a thread in queue_put(), if the number of active threads in the group is less or equal to this parameter;</li>
<li><em>wake_top_limit</em> – create a new thread in <code class="language-plaintext highlighter-rouge">wake_or_create_thread()</code> only if the number of active threads in the group is less or equal to this parameter;</li>
<li><em>create_thread_on_wait</em> – a boolean parameter define if a new thread should be created in <code class="language-plaintext highlighter-rouge">wait_begin()</code>;</li>
<li><em>idle_timeout</em> – the maximum time a thread can be in the idle state;</li>
<li><em>listener_wake_limit</em> – the listener thread wakes up an idle thread, if the number of active threads is less or equal to this parameter;</li>
<li><em>listener_create_limit</em> – the listener thread creates a new thread, if the number of active threads is less or equal to this parameter.</li>
</ul>
<p>And a workload profile may look like this:</p>
<ul>
<li>the number of persistent connections;</li>
<li>the number of CPUs (the return value of <code class="language-plaintext highlighter-rouge">getncpus()</code>)</li>
<li>the latency of a new thread creation (timing of the <code class="language-plaintext highlighter-rouge">create_worker()</code> function);</li>
<li>the number of active rounds in processing of a single request (from the start of the execution to the first <code class="language-plaintext highlighter-rouge">wait_begin()</code> call + all rounds from <code class="language-plaintext highlighter-rouge">wait_end()</code> to <code class="language-plaintext highlighter-rouge">wait_begin()</code> + the round from the last <code class="language-plaintext highlighter-rouge">wait_end()</code> to the end of execution);</li>
<li>the duration of a single active round in a request execution;</li>
<li>the duration of a single wait round in a request execution;</li>
<li>the time interval between the end of a single request execution and the start of the next request execution in <code class="language-plaintext highlighter-rouge">io_poll_wait()</code> for one connection.</li>
</ul>
<h2 id="references">References</h2>
<p><a name="1">1</a>: L. M. Abualigah, E. S. Hanandeh, T. A. Khader, M. Otair, S. K. Shandilya, An Improved $\beta$-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem. – Current medical imaging reviews, vol. 14(4), 2020, pp.296-306.</p>
<p><a name="2">2</a>: M. A. Al-Betar, $\beta$-Hill climbing: an exploratory local search. – Neural Computing & Applications, vol.28, 2017, pp.153-168.</p>
<p><a name="3">3</a>: K. Bauskar, MariaDB thread pool and NUMA scalability (December 2021). – <a href="https://mysqlonarm.github.io/mdb-tpool-and-numa">https://mysqlonarm.github.io/mdb-tpool-and-numa</a></p>
<p><a name="4">4</a>: J. Borghoff, L. R. Knudsen, K. Matusiewicz, Hill Climbing Algorithms and Trivium. – 17th International Workshop, Selected Areas in Cryptography (SAC) – 2010, Waterloo, Ontario, Canada, August 2010 (Springer, 2011), pp.57-73.</p>
<p><a name="5">5</a>: E. Fuentes, Concurrency – Throttling Concurrency in the CLR 4.0 Threadpool (September 2010). – <a href="https://docs.microsoft.com/en-us/archive/msdn-magazine/2010/September/concurrency-throttling-concurrency-in-the-clr-4-0-threadpool">https://docs.microsoft.com/en-us/archive/msdn-magazine/2010/September/concurrency-throttling-concurrency-in-the-clr-4-0-threadpool</a></p>
<p><a name="6">6</a>: J. L. Hellerstein, V. Morrison, E. Eilebrecht, Applying Control Theory in the Real World. – ACM’SIGMETRICS Performance Evaluation Rev., Volume 37, Issue 3, 2009, pp.38-42. doi: 10.1145/1710115.1710123.</p>
<p><a name="7">7</a>: J. L. Hellerstein, V. Morrison, E. Eilebrecht, Optimizing Concurrency Levels in the .NET Threadpool. – FeBID Workshop 2008, Annapolis, MD USA.</p>
<p><a name="8">8</a>: L. Hernando, A. Mendiburu, J. P. Lozano, Hill-Climbing Algorithm: Let’s Go for a Walk Before Finding the Optimum. – 2018 IEEE Congress on Evolutionary Computation (CEC), 2018, pp.1-7.</p>
<p><a name="9">9</a>: Hill climbing. – <a href="https://en.wikipedia.org/wiki/Hill_climbing">https://en.wikipedia.org/wiki/Hill_climbing</a></p>
<p><a name="10">10</a>: D. Iclanzan, D. Dumitrescu, Overcoming Hierarchical Difficulty by Hill-Climbing the Building Block Structure (February 2007). – <a href="https://arxiv.org/abs/cs/0702096">https://arxiv.org/abs/cs/0702096</a></p>
<p><a name="11">11</a>: A. Ilinchik, How to set an ideal thread pool size (April 2019). – <a href="https://engineering.zalando.com/posts/2019/04/how-to-set-an-ideal-thread-pool-size.html">https://engineering.zalando.com/posts/2019/04/how-to-set-an-ideal-thread-pool-size.html</a></p>
<p><a name="12">12</a>: A. W. Johnson, Generalized Hill Climbing Algorithms for Discrete Optimization Problems. - PhD thesis, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, October, 1996, 132 pp. - <font size="4"><a href="https://researchgate.net/publication/277791527_Generalized_Hill_Climbing_Algorithms_For_Discrete_Optimization_Problems">https://researchgate.net/publication/277791527_Generalized_Hill_Climbing_Algorithms_For_Discrete_Optimization_Problems</a></font></p>
<p><a name="13">13</a>: A. Mughees, How to benchmark performance of MySQL using Sysbench (June 2020). – <a href="https://ittutorial.org/how-to-benchmark-performance-of-mysql-using-sysbench">https://ittutorial.org/how-to-benchmark-performance-of-mysql-using-sysbench</a></p>
<p><a name="14">14</a>: K. Nagarajan, A Predictive Hill Climbing Algorithm for Real Valued multi-Variable Optimization Problem Like PID Tuning. – International Journal of Machine Learning and Computing, vol.8, No.1, February, 2018. – <a href="https://ijmlc.org/vol8/656-A11.pdf">https://ijmlc.org/vol8/656-A11.pdf</a></p>
<p><a name="15">15</a>: Oracle GlassFish Server 3.1 Performance Tuning Guide. – <a href="https://docs.oracle.com/cd/E18930_01/pdf/821-2431.pdf">https://docs.oracle.com/cd/E18930_01/pdf/821-2431.pdf</a></p>
<p><a name="16">16</a>: K. Pepperdine, Tuning the Size of Your Thread Pool (May 2013). – <a href="https://infoq.com/articles/Java-Thread-Pool-Performance-Tuning">https://infoq.com/articles/Java-Thread-Pool-Performance-Tuning</a></p>
<p><a name="17">17</a>: A. Rosete-Suarez, A. Ochoa-Rodriquez, M. Sebag, Automatic Graph Drawing and Stochastic Hill Climbing. – GECCO’99: Proceedings of the First Annual Conference on Genetic and Evolutionary Computing, vol. 2, July 1999, pp.1699-1706.</p>
<p><a name="18">18</a>: S. Ruder, An overview of gradient descent optimization algorithms (2016). – <a href="https://arxiv.org/abs/1609.04747">https://arxiv.org/abs/1609.04747</a></p>
<p><a name="19">19</a>: M. Schüle, F. Simonis, T. Heyenbrock, A. Kemper, S. Günnemann, T. Neumann, In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems. – In: Grust, T., Naumann, F., Böhm, A., Lehner, W., Härder, T., Rahm, E., Heuer, A., Klettke, M. & Meyer, H. (Hrsg.), BTW 2019. Gesellschaft für Informatik, Bonn. pp. 247-266. – <a href="https://dl.gi.de/bitstream/handle/20.500.12116/21700/B6-1.pdf?sequence=1&isAllowed=y">https://dl.gi.de/bitstream/handle/20.500.12116/21700/B6-1.pdf?sequence=1&isAllowed=y</a> doi: 10.184.20/btw2019-16.</p>
<p><a name="20">20</a>: B. Selman, C. P. Gomes, Hill-climbing Search (2001). – <a href="https://www.cs.cornell.edu/selman/papers/pdf/02.encycl-hillclimbing.pdf">https://www.cs.cornell.edu/selman/papers/pdf/02.encycl-hillclimbing.pdf</a></p>
<p><a name="21">21</a>: J. Timm, An OS-level adaptive thread pool scheme for I/O-heavy workloads. – Master thesis, Delft University of Technology, 2021. <a href="https://repository.tudelft.nl/islandora/object/uuid%3A5c9b4c42-8fdc-4170-b978-f80cd8f00753">https://repository.tudelft.nl/islandora/object/uuid%3A5c9b4c42-8fdc-4170-b978-f80cd8f00753</a></p>
<p><a name="22">22</a>: M. Warren, The CLR Thread Pool ‘Thread Injection’ Algorithm (April 2017). – <a href="https://codeproject.com/Articles/1182012/The-CLR-Thread-Pool-Thread-Injection-Algorithm">https://codeproject.com/Articles/1182012/The-CLR-Thread-Pool-Thread-Injection-Algorithm</a></p>
<p><a name="23">23</a>: J.-H. Wu, R. Kalyanam, P. Givan, Stochastic Enforced Hill-Climbing. – Proceedings of the 2013 IEEE International Conference on Systems, Man and Cybernetics, October 2013.</p>
<p><a name="24">24</a>: X. Zhou, J. Sun, Database Meets Artificial Intelligence. – IEEE Transactions on Knowledge and Data Engineering, May 2020. doi: 10.1109/TKDE.2020.2994641.</p>
<p><a name="25">25</a>: <a href="https://github.com/dotnet/coreclr/blob/master/src/vm/win32threadpool.cpp">https://github.com/dotnet/coreclr/blob/master/src/vm/win32threadpool.cpp</a></p>Ilya TrubIn our previous blog post we discussed the purpose of a thread pool, various approaches to implementing a thread pool, along with a simulation model describing thread pool implementations in MariaDB and Percona Server. In this post we will look into another methodology of tuning the thread pool size, namely the adaptive Hill Climbing algorithm.Spinning in the Cloud: How to Fix MySQL 8.0 Log Commit for Containers2021-12-27T00:00:00+03:002021-12-27T00:00:00+03:00https://mysqlperf.github.io/mysql/spinning-in-the-cloud-redo-log-commit<p>If you are running MySQL as a Kubernetes pod or a Docker container, there is a
chance you are using CPU quota to limit its resource usage. Which is also typical for cloud environments. But do you know what
kind of issues you may see when running MySQL in environments like that?</p>
<h1 id="experiment">Experiment</h1>
<p>I’ll start MySQL as follows:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">BASEDIR</span><span class="o">=</span>/home/sergei/git/msql/bld/install/usr/local/mysql
systemd-run <span class="nt">--scope</span> <span class="nt">-p</span> <span class="nv">CPUQuota</span><span class="o">=</span>100% <span class="nt">-p</span> <span class="nv">AllowedCPUs</span><span class="o">=</span>0,1,2,3,4,5 <span class="se">\</span>
<span class="k">${</span><span class="nv">BASEDIR</span><span class="k">}</span>/bin/mysqld <span class="nt">--basedir</span><span class="o">=</span><span class="k">${</span><span class="nv">BASEDIR</span><span class="k">}</span> <span class="se">\</span>
<span class="nt">--datadir</span><span class="o">=</span>/dev/shm/data <span class="se">\</span>
<span class="nt">--innodb-buffer-pool-size</span><span class="o">=</span>8G <span class="nt">-uroot</span>
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">CPUQuota=100%</code> and <code class="language-plaintext highlighter-rouge">AllowedCPUs=0,1,2,3,4,5</code> mean that MySQL is allowed to
run on 6 CPUs, but its CPU utilization will be capped at 100% (or 1 vCPU) by the CFS
bandwith control mechanism.</p>
<p>Then I’ll start <code class="language-plaintext highlighter-rouge">sysbench</code> <code class="language-plaintext highlighter-rouge">OLTP_UPDATE_INDEX</code> as follows:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./src/sysbench ./src/lua/oltp_update_index.lua <span class="se">\</span>
<span class="nt">--mysql-socket</span><span class="o">=</span>/tmp/mysql.sock <span class="nt">--mysql-user</span><span class="o">=</span>root <span class="se">\</span>
<span class="nt">--tables</span><span class="o">=</span>10 <span class="nt">--table-size</span><span class="o">=</span>1000000 <span class="nt">--threads</span><span class="o">=</span>64 <span class="se">\</span>
<span class="nt">--report-interval</span><span class="o">=</span>1 <span class="nt">--db-ps-mode</span><span class="o">=</span>disable <span class="nt">--time</span><span class="o">=</span>60 run
</code></pre></div></div>
<p>The TPS I get is around 3,500:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SQL statistics:
queries performed:
read: 0
write: 207800
other: 0
total: 207800
transactions: 207800 (3462.61 per sec.)
queries: 207800 (3462.61 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
Throughput:
events/s (eps): 3462.6082
time elapsed: 60.0126s
total number of events: 207800
Latency (ms):
min: 0.12
avg: 18.48
max: 286.17
95th percentile: 82.96
sum: 3840431.03
Threads fairness:
events (avg/stddev): 3246.8750/45.59
execution time (avg/stddev): 60.0067/0.00
</code></pre></div></div>
<p>I am not quite happy with the performance numbers I get, time for some
profiling. Below is the most interesting part of the perf output:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">-</span> <span class="mf">21.20</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span> <span class="p">[.]</span> <span class="n">log_write_up_to</span>
<span class="n">log_write_up_to</span>
<span class="n">innobase_flush_logs</span>
<span class="n">plugin_foreach_with_mask</span>
<span class="n">plugin_foreach_with_mask</span>
<span class="n">ha_flush_logs</span>
<span class="n">MYSQL_BIN_LOG</span><span class="o">::</span><span class="n">fetch_and_process_flush_stage_queue</span>
<span class="n">MYSQL_BIN_LOG</span><span class="o">::</span><span class="n">process_flush_stage_queue</span>
<span class="n">MYSQL_BIN_LOG</span><span class="o">::</span><span class="n">ordered_commit</span>
<span class="n">MYSQL_BIN_LOG</span><span class="o">::</span><span class="n">commit</span>
<span class="n">ha_commit_trans</span>
<span class="n">trans_commit_stmt</span>
<span class="n">mysql_execute_command</span>
<span class="n">dispatch_sql_command</span>
<span class="n">dispatch_command</span>
<span class="n">do_command</span>
<span class="n">handle_connection</span>
<span class="n">pfs_spawn_thread</span>
<span class="n">start_thread</span>
</code></pre></div></div>
<h1 id="problem">Problem</h1>
<p>We see that 21% of CPU time is spent in the <code class="language-plaintext highlighter-rouge">log_write_up_to</code> function which is called on
commit. We can actually annotate this function to see what exactly this time is
spent on:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Percent</span><span class="err">│</span>
<span class="err">│</span> <span class="k">if</span> <span class="p">(</span><span class="n">condition</span><span class="p">(</span><span class="n">wait</span><span class="p">))</span> <span class="p">{</span>
<span class="err">│</span> <span class="k">return</span> <span class="p">(</span><span class="n">Wait_stats</span><span class="p">{</span><span class="n">waits</span><span class="p">});</span>
<span class="err">│</span> <span class="p">}</span>
<span class="err">│</span>
<span class="err">│</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">wait</span><span class="p">)</span> <span class="p">{</span>
<span class="mf">0.46</span> <span class="err">│</span><span class="mi">352</span><span class="o">:</span> <span class="n">test</span> <span class="o">%</span><span class="n">r13</span><span class="p">,</span><span class="o">%</span><span class="n">r13</span>
<span class="err">│</span> <span class="err">↓</span> <span class="n">je</span> <span class="mi">518</span>
<span class="err">│</span> <span class="cm">/* It's still spin-delay loop. */</span>
<span class="err">│</span> <span class="o">--</span><span class="n">spins_limit</span><span class="p">;</span>
<span class="err">│</span> <span class="n">sub</span> <span class="err">$</span><span class="mh">0x1</span><span class="p">,</span><span class="o">%</span><span class="n">r13</span>
<span class="err">│</span>
<span class="err">│</span> <span class="n">UT_RELAX_CPU</span><span class="p">();</span>
<span class="mf">96.30</span> <span class="err">│</span> <span class="n">pause</span>
<span class="err">│</span> <span class="k">const</span> <span class="kt">int64_t</span> <span class="n">sig_count</span> <span class="o">=</span> <span class="o">!</span><span class="n">wait</span> <span class="o">?</span> <span class="mi">0</span> <span class="o">:</span> <span class="n">os_event_reset</span><span class="p">(</span><span class="n">event</span><span class="p">);</span>
<span class="mf">0.36</span> <span class="err">│</span><span class="mi">361</span><span class="o">:</span> <span class="n">movq</span> <span class="err">$</span><span class="mh">0x0</span><span class="p">,</span><span class="o">-</span><span class="mh">0x98</span><span class="p">(</span><span class="o">%</span><span class="n">rbp</span><span class="p">)</span>
<span class="err">│</span> <span class="n">test</span> <span class="o">%</span><span class="n">r13</span><span class="p">,</span><span class="o">%</span><span class="n">r13</span>
<span class="err">│</span> <span class="err">↓</span> <span class="n">je</span> <span class="mi">582</span>
<span class="err">│</span> <span class="n">std</span><span class="o">::</span><span class="n">__uniq_ptr_impl</span><span class="o"><</span><span class="n">Log_test</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">default_delete</span><span class="o"><</span><span class="n">Log_test</span><span class="o">></span> <span class="o">>::</span><span class="n">_M_ptr</span><span class="p">()</span> <span class="k">const</span><span class="o">:</span>
<span class="err">│</span><span class="mi">375</span><span class="o">:</span> <span class="n">lea</span> <span class="n">log_test</span><span class="p">,</span><span class="o">%</span><span class="n">rax</span>
<span class="mf">0.34</span> <span class="err">│</span> <span class="n">mov</span> <span class="p">(</span><span class="o">%</span><span class="n">rax</span><span class="p">),</span><span class="o">%</span><span class="n">r8</span>
<span class="err">│</span> <span class="k">operator</span><span class="p">()()</span><span class="o">:</span>
<span class="err">│</span> <span class="n">LOG_SYNC_POINT</span><span class="p">(</span><span class="s">"log_wait_for_flush_before_flushed_to_disk_lsn"</span><span class="p">);</span>
<span class="err">│</span> <span class="n">test</span> <span class="o">%</span><span class="n">r8</span><span class="p">,</span><span class="o">%</span><span class="n">r8</span>
<span class="err">│</span> <span class="n">mov</span> <span class="o">%</span><span class="n">r8</span><span class="p">,</span><span class="o">-</span><span class="mh">0x90</span><span class="p">(</span><span class="o">%</span><span class="n">rbp</span><span class="p">)</span>
<span class="err">│</span> <span class="err">↓</span> <span class="n">je</span> <span class="mi">423</span>
</code></pre></div></div>
<p>The answer is simple - MySQL is spinning 21% of its CPU time.</p>
<p>Let me give you some background. MySQL 8.0 comes with a redesigned redo logging
subsystem. There is now a dedicated redo log writer thread which writes data from
the redo log buffer to disk and a dedicated redo log flusher thread which calls
<code class="language-plaintext highlighter-rouge">fsync()</code> on the log files.</p>
<p>A client thread committing a transaction now simply writes to the redo log
buffer, updates the lock-free <code class="language-plaintext highlighter-rouge">log.recent_written</code> <code class="language-plaintext highlighter-rouge">Link_buf</code> structure with the
<code class="language-plaintext highlighter-rouge">LSN</code> it has written up to, and the waits for <code class="language-plaintext highlighter-rouge">log.flushed_to_disk_lsn</code> (or
<code class="language-plaintext highlighter-rouge">log.write_lsn</code> depending on the <code class="language-plaintext highlighter-rouge">innodb_log_flush_at_trx_commit</code> setting) to bypass
the written <code class="language-plaintext highlighter-rouge">LSN</code>.</p>
<p>How is that waiting implemented? There are two arrays of conditional variables,
2048 elements each (there’s actually a setting which is hidden under the
<code class="language-plaintext highlighter-rouge">ENABLE_EXPERIMENT_SYSVARS</code> compiler define, one could enable it, rebuild and
play with that setting) - <code class="language-plaintext highlighter-rouge">log.write_events</code> and <code class="language-plaintext highlighter-rouge">log.flush_events</code>. There are
also two notifier threads: <code class="language-plaintext highlighter-rouge">log_write_notifier</code> and <code class="language-plaintext highlighter-rouge">log_flush_notifier</code> which
fire up corresponding conditional variables when redo log block gets written or
flushed.</p>
<p>This scheme works fine, but there are some issues with it. Lets consider we have
a single client thread which committed a short transaction. It now has to wait
on a conditional variable to be signaled by <code class="language-plaintext highlighter-rouge">log_flush_notifier</code> which is costly
in terms of latency. It is much better to spin-wait for <code class="language-plaintext highlighter-rouge">log.flush_lsn</code> for a while
and, in case the redo log gets flushed soon, return to the client without waiting on
the conditional variable. It will save us some latency on syscalls and context switching.</p>
<p>The question is - for how long can we spin and when should we fall back to waiting?
The answer by the MySQL server team is - adaptive spinning. The client
thread will spin, if there are spare CPU cycles, and wait if the CPU is hogged. There are two variables to control spinning:</p>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">innodb_log_spin_cpu_abs_lwm</code> which defines the minimum amount of CPU usage
below which threads no longer spin (default is 80%, here we look at the CPU
utilization as reported by <code class="language-plaintext highlighter-rouge">top</code>)</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">innodb_log_spin_cpu_pct_hwm</code> which defines the maximum amount of CPU usage
above which user threads no longer spin (default is 50%, here we take the CPU
utilization as reported by <code class="language-plaintext highlighter-rouge">top</code> and divide it by the number of available
CPUs)</p>
</li>
</ul>
<p>OK, lets have a look at the <code class="language-plaintext highlighter-rouge">top</code> output. CPU utilization by <code class="language-plaintext highlighter-rouge">mysqld</code> is
reported between <strong>96%</strong> and <strong>102%</strong>, we maxed out our CPU quota, so there should be no
spinning!</p>
<p>It’s time for <code class="language-plaintext highlighter-rouge">gdb</code>. CPU usage statistics are accumulated in the global variable
called <code class="language-plaintext highlighter-rouge">srv_cpu_usage</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">gdb</span><span class="p">)</span> <span class="n">p</span> <span class="n">srv_cpu_usage</span>
<span class="err">$</span><span class="mi">1</span> <span class="o">=</span> <span class="p">{</span><span class="n">n_cpu</span> <span class="o">=</span> <span class="mi">6</span><span class="p">,</span> <span class="n">utime_abs</span> <span class="o">=</span> <span class="mf">83.261736069525156</span><span class="p">,</span> <span class="n">stime_abs</span> <span class="o">=</span> <span class="mf">15.321047995993085</span><span class="p">,</span> <span class="n">utime_pct</span> <span class="o">=</span> <span class="mf">13.876956011587525</span><span class="p">,</span>
<span class="n">stime_pct</span> <span class="o">=</span> <span class="mf">2.5535079993321808</span><span class="p">}</span>
<span class="p">(</span><span class="n">gdb</span><span class="p">)</span>
</code></pre></div></div>
<p>Lets interpet these numbers:</p>
<ul>
<li>
<p>mysqld sees 6 CPUs which is how many we have specified with
<code class="language-plaintext highlighter-rouge">AllowedCPUs=0,1,2,3,4,5</code></p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">utime_abs + stime_abs</code> (sum of the user and system CPU time) is <strong>98.5%</strong> which
is in line with what <code class="language-plaintext highlighter-rouge">top</code> reports</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">utime_pct + stime_pct</code> is <strong>16.4%</strong> which is simply <code class="language-plaintext highlighter-rouge">98.5/6</code></p>
</li>
</ul>
<p>But the <code class="language-plaintext highlighter-rouge">pct</code> values are off. MySQL considers all 6 cores at its
disposal and since they appear to be underutilized (16% is way below the 50% high water mark) it can spin
to improve latency. MySQL simply doesn’t know anything about the CFS Quota I specified for it.</p>
<h1 id="workaround">Workaround</h1>
<p>Lets verify our assumption. Here is the corresponding code in
<code class="language-plaintext highlighter-rouge">srv_update_cpu_usage()</code>:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">cpu_set_t</span> <span class="n">cs</span><span class="p">;</span>
<span class="n">CPU_ZERO</span><span class="p">(</span><span class="o">&</span><span class="n">cs</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sched_getaffinity</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">cs</span><span class="p">),</span> <span class="o">&</span><span class="n">cs</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">n_cpu</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">constexpr</span> <span class="kt">int</span> <span class="n">MAX_CPU_N</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">MAX_CPU_N</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">CPU_ISSET</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="o">&</span><span class="n">cs</span><span class="p">))</span> <span class="p">{</span>
<span class="o">++</span><span class="n">n_cpu</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It simply obtains the affinity of the <code class="language-plaintext highlighter-rouge">mysqld</code> process and counts the number of CPUs in the
<code class="language-plaintext highlighter-rouge">cpuset</code>. Lets hard code <code class="language-plaintext highlighter-rouge">n_cpu = 1</code> and repeat our <code class="language-plaintext highlighter-rouge">sysbench</code> test:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SQL statistics:
queries performed:
read: 0
write: 280852
other: 0
total: 280852
transactions: 280852 (4677.44 per sec.)
queries: 280852 (4677.44 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
Throughput:
events/s (eps): 4677.4425
time elapsed: 60.0439s
total number of events: 280852
Latency (ms):
min: 0.11
avg: 13.68
max: 193.05
95th percentile: 86.00
sum: 3842379.62
Threads fairness:
events (avg/stddev): 4388.3125/63.74
execution time (avg/stddev): 60.0372/0.00
</code></pre></div></div>
<p>Looks much better, and the spinning has gone:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">+</span> <span class="mf">2.85</span><span class="o">%</span> <span class="n">connection</span> <span class="n">libpthread</span><span class="o">-</span><span class="mf">2.33</span><span class="p">.</span><span class="n">so</span> <span class="p">[.]</span> <span class="n">__pthread_mutex_cond_lock</span>
<span class="o">+</span> <span class="mf">2.47</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">ut_delay</span>
<span class="o">+</span> <span class="mf">1.40</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">MYSQLparse</span>
<span class="o">+</span> <span class="mf">1.13</span><span class="o">%</span> <span class="n">connection</span> <span class="p">[</span><span class="n">kernel</span><span class="p">.</span><span class="n">kallsyms</span><span class="p">]</span> <span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="n">syscall_exit_to_user_mode</span>
<span class="o">+</span> <span class="mf">0.95</span><span class="o">%</span> <span class="n">connection</span> <span class="n">libc</span><span class="o">-</span><span class="mf">2.33</span><span class="p">.</span><span class="n">so</span> <span class="p">[.]</span> <span class="n">__memmove_avx_unaligned_erms</span>
<span class="o">+</span> <span class="mf">0.89</span><span class="o">%</span> <span class="n">connection</span> <span class="n">libc</span><span class="o">-</span><span class="mf">2.33</span><span class="p">.</span><span class="n">so</span> <span class="p">[.]</span> <span class="n">malloc</span>
<span class="o">+</span> <span class="mf">0.87</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">rec_get_offsets_func</span>
<span class="o">+</span> <span class="mf">0.75</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">rec_init_offsets</span>
<span class="o">+</span> <span class="mf">0.73</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">mutex_enter_inline</span><span class="o"><</span><span class="n">PolicyMutex</span><span class="o"><</span><span class="n">TTASEventMutex</span><span class="o"><</span><span class="n">GenericPolicy</span><span class="o">></span> <span class="o">></span> <span class="o">></span>
<span class="o">+</span> <span class="mf">0.69</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">page_cur_insert_rec_write_log</span>
<span class="o">+</span> <span class="mf">0.65</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">ha_insert_for_fold_func</span>
<span class="o">+</span> <span class="mf">0.62</span><span class="o">%</span> <span class="n">connection</span> <span class="p">[</span><span class="n">kernel</span><span class="p">.</span><span class="n">kallsyms</span><span class="p">]</span> <span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="n">psi_group_change</span>
<span class="o">+</span> <span class="mf">0.55</span><span class="o">%</span> <span class="n">connection</span> <span class="n">libpthread</span><span class="o">-</span><span class="mf">2.33</span><span class="p">.</span><span class="n">so</span> <span class="p">[.]</span> <span class="n">__pthread_mutex_lock</span>
<span class="o">+</span> <span class="mf">0.50</span><span class="o">%</span> <span class="n">connection</span> <span class="n">mysqld</span><span class="o">-</span><span class="mi">1</span><span class="n">cpu</span> <span class="p">[.]</span> <span class="n">buf_page_hash_get_low</span>
</code></pre></div></div>
<p>We can get a similar effect by setting <code class="language-plaintext highlighter-rouge">innodb_log_spin_cpu_pct_hwm=8</code> (which is 50 / 6).</p>
<h1 id="conclusion">Conclusion</h1>
<p>Adaptiveness is the future of databases, and we will see lot more of it
coming. As well as more and more MySQL instances will be running in
various cloud environments.</p>
<p class="notice--info"><strong>Adaptive MySQL code should consider taking into account cloud
environments, including the ones that use CFS bandwidth control
mechanisms.</strong></p>
<p>Even though the current implementation of adaptive spinning in the redo
log writer in MySQL 8.0 is not container/quota/cloud aware, a simple
workaround can be used by tuning the <code class="language-plaintext highlighter-rouge">innodb_log_spin_cpu_pct_hwm</code>
system variable.</p>Sergey GlushchenkoIf you are running MySQL as a Kubernetes pod or a Docker container, there is a chance you are using CPU quota to limit its resource usage. Which is also typical for cloud environments. But do you know what kind of issues you may see when running MySQL in environments like that?Simulation of thread pool for database server2021-12-06T18:00:00+03:002021-12-06T18:00:00+03:00https://mysqlperf.github.io/mysql/simulation-of-threadpool-for-database-server<p>This is a blog post version of our paper “Simulation of thread pool for database server” that will be published in <a href="http://ceur-ws.org/">CEUR Workshop Proceedings</a>. In the article, we consider an object-oriented simulation model for thread pool. The implementation of thread pool in MariaDB and Percona Server was taken as a basis. Model’s input flow and their distributions are described. Model’s output results are consistent with known “concurrency level – throughput” dependency patterns for IO- and CPU-bound workloads. The model is written in C++ and its software architecture is also considered, including provided classes, methods and call graph. The model takes into account “thread contention” phenomena and mathematical expressions for it were proposed. The built model has a practical value as an effective tool for static and dynamic analysis of the most significant parameters affecting performance and optimal choice of these parameters.</p>
<h2 id="introduction">Introduction</h2>
<p>Simulation is known to be an effective decision-making tool in a wide range of applied problems such as industry, transport, medicine and military science. However, the use of simulation as “computer science for itself” is equally important. There are a lot of examples how to use simulation in software development and design of complex IT-systems. One of such system is thread pool, which is has been implemented in various software systems over the past 25 years. The concept of thread pool is an alternative to rule “one connection – one thread”. It allows not only to save resources, but also to improve the performance of a whole software product in general. The basic idea is to re-use already existing thread for handling of new task. There are several thread pool implementations and <sup><a href="#10">[10]</a></sup> contains the most extensive review of them. There are also a lot of documentation about specific ones. So, one of the first thread pools was described in <sup><a href="#14">[14]</a></sup> for broker of object queries. Thread pools for Android and application server Oracle GlassFish are described in <sup><a href="#1">[1]</a></sup> and <sup><a href="#11">[11]</a></sup> respectively. <sup><a href="#26">[26]</a></sup> contains an effective example of using of Python thread pool for difficult scientific problem. Microsoft CLR thread pool is described in fundamental work [18] and the most recent open- source implementation is available in <sup><a href="#24">[24]</a></sup>. Java threading specific extension is proposed in scientific research <sup><a href="#25">[25]</a></sup>. DBMS developers also pay much attention to this feature, in particular, MySQL<sup><a href="#15">[15]</a></sup>, MariaDB<sup><a href="#20">[20]</a></sup>, Percona Server<sup><a href="#13">[13]</a></sup>. The distinctive features of these thread pools are the following:</p>
<ul>
<li>Connections are put into a thread group at connect time on a round-robin basis. The number of thread groups is configurable.</li>
<li>Each thread group tries to keep the number of active threads, being executed on CPU, to one or zero. If a query is already executing in the thread group, put the connection in the wait queue.</li>
<li>Put waiting connections into the high priority queue when a transaction is already started on the connection.</li>
<li>Allow another query to execute if the queue is not empty and there are no completed queries during the specified time interval. It is provided by special thread called <em>Timer</em>.</li>
</ul>
<p>The paper is focused only on the model of this thread pool variety.
All thread pools contain many parameters, which are assigned by a developer or DBA and affect the final thread pool performance. The main parameter is thread pool size (<em>tp-size</em>) that is also <em>concurrency level</em>. The number of thread groups plays this role in above- mentioned DBMS implementations. The choice of tp-size depends on many factors, such as CPU number, memory volume, number of concurrent client requests, but the finest is so-called <em>workload profile</em>. It should be noted that optimal values of tp-size are significantly different for CPU-bound and IO-bound workloads even if the number of connections to the server is the same <sup><a href="#22">[22]</a></sup>. Many articles contain recommendations how to choose tp-size in a simple way. Examples are <sup><a href="#1">[1]</a></sup>, <sup><a href="#7">[7]</a></sup>, <sup><a href="#8">[8]</a></sup>, <sup><a href="#11">[11]</a></sup>, <sup><a href="#12">[12]</a></sup>, where suggestions are based on the number of CPU cores, average request latency on CPU, average off-CPU time and well-known in queueing theory Little’s law. However, these approaches does not allow to maximize throughput and Microsoft specialists achieved the greatest success in this way. Articles <sup><a href="#5">[5]</a></sup>, <sup><a href="#6">[6]</a></sup> consider the use of <em>HillClimbing</em> optimizer for the choice of tp-size, <sup><a href="#21">[21]</a></sup> contains test results. Work <sup><a href="#4">[4]</a></sup> describes the variety of <em>HillClimbing</em>, which is based not on gradient decline, but on signal processing approach, because it is more stable to the influence of random fluctuations. Other algorithms for tp-size calculation are proposed in <sup><a href="#10">[10]</a></sup> and <sup><a href="#19">[19]</a></sup>. The most recent work <sup><a href="#19">[19]</a></sup> contains full and actual reference list for this problem.
At the same time, a simulation model can help to get a deep understanding how thread pool does work. So, thread pool performance (expressed, for example, in transactions per second) depends not only on tp-size, but on other parameters. To clear how each parameter affects, many long-time and expensive experiments are needed on working servers. However, a smart simulation model can do that much faster, which is the main advantage of simulation for any task. It should be said that simulation approach is tested rather weakly now days. Work <sup><a href="#2">[2]</a></sup> uses too specific and rare tool, the recent work <sup><a href="#17">[17]</a></sup> of Ukrainian specialists applies stochastic Petri nets to simulate thread pool. Our paper proposes thread pool simulation model where implementation from <sup><a href="#13">[13]</a></sup> is taken as basis. The model is written in C++ with methodology described in <sup><a href="#27">[27]</a></sup>, which allows to cover flexibly all algorithmic features of the system in full. The thread pool itself is described in section 2, section 3 contains the software architecture of the proposed model. Some results of the model’s validation are exposed in section 4 and section 5 contains conclusions and suggestions how to use the model.</p>
<h2 id="description-of-simulated-thread-pool">Description of simulated thread pool</h2>
<p>Neglecting secondary details, a thread pool call graph looks like this (Fig. 1), where designations corresponds to the following functions (Table 1).</p>
<p style="text-align: center; font-size:0.7em;"><a href="/assets/images/54d97df0863605abb74f9ca9c4d176bf.jpeg" title="Figure 1: Thread pool call graph"><img src="/assets/images/54d97df0863605abb74f9ca9c4d176bf.jpeg" alt="54d97df0863605abb74f9ca9c4d176bf" /></a>
Figure 1: Thread pool call graph</p>
<p style="text-align: center; font-size:0.7em;">Table 1. Thread pool functions</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Description</th>
<th>Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 <code class="language-plaintext highlighter-rouge">add_connection</code></td>
<td>Add a new connection, choose thread group for it</td>
<td>11 <code class="language-plaintext highlighter-rouge">timeout_check</code></td>
<td>Check, if the connection has expired (request took too long time); if yes, delete connection</td>
</tr>
<tr>
<td>2 <code class="language-plaintext highlighter-rouge">wait_begin</code></td>
<td>Callback for the start of off-CPU round</td>
<td>12 <code class="language-plaintext highlighter-rouge">create_worker</code></td>
<td>Create a new thread</td>
</tr>
<tr>
<td>3 <code class="language-plaintext highlighter-rouge">start_timer</code></td>
<td>Start a timer thread to track stalled threads</td>
<td><code class="language-plaintext highlighter-rouge">13 wake_thread</code></td>
<td>Wake an idle thread</td>
</tr>
<tr>
<td>4 <code class="language-plaintext highlighter-rouge">set_tp_size</code></td>
<td>Set the thread pool size</td>
<td>14 <code class="language-plaintext highlighter-rouge">too_many_threads</code></td>
<td>Check if there are too many active threads in group</td>
</tr>
<tr>
<td>5 <code class="language-plaintext highlighter-rouge">wait_end</code></td>
<td>Callback for the end of off-CPU round</td>
<td>15 <code class="language-plaintext highlighter-rouge">worker_main</code></td>
<td>Main function for thread from thread pool</td>
</tr>
<tr>
<td>6 <code class="language-plaintext highlighter-rouge">queue_put</code></td>
<td>Put a new connection into queue</td>
<td>16 <code class="language-plaintext highlighter-rouge">handle_event</code></td>
<td>Preparing to serve a request</td>
</tr>
<tr>
<td>7 <code class="language-plaintext highlighter-rouge">timer_thread</code></td>
<td>Main function for the timer thread</td>
<td>17 <code class="language-plaintext highlighter-rouge">get_event</code></td>
<td>Assign a connection to ready thread (make it active)</td>
</tr>
<tr>
<td>8 <code class="language-plaintext highlighter-rouge">wakeCreateThread</code></td>
<td>Create a new thread or wake idle</td>
<td>18 <code class="language-plaintext highlighter-rouge">process_request</code></td>
<td>Serve a request by thread</td>
</tr>
<tr>
<td>9 <code class="language-plaintext highlighter-rouge">queues_are_empty</code></td>
<td>Check queues</td>
<td>19 <code class="language-plaintext highlighter-rouge">listener</code></td>
<td>Thread for polling, repeatedly extract connection from thread group’s open file descriptor</td>
</tr>
<tr>
<td>10 <code class="language-plaintext highlighter-rouge">check_stall</code></td>
<td>Treat stalled threads</td>
<td>20 <code class="language-plaintext highlighter-rouge">queue_get</code></td>
<td>Extract a connection from queue</td>
</tr>
</tbody>
</table>
<h2 id="description-of-simulation-model">Description of simulation model</h2>
<p>Let’s list the input values for model, which are produced by a random number generator with the given distribution:</p>
<ul>
<li>
<p>the input flow of connections: the distribution of time intervals between <code class="language-plaintext highlighter-rouge">add_connection()</code> calls;</p>
</li>
<li>
<p>the time of new thread creation: timing for <code class="language-plaintext highlighter-rouge">create_worker()</code>;</p>
</li>
<li>
<p>the duration of one active round for thread: the time from the start of request serving till the first <code class="language-plaintext highlighter-rouge">wait_begin()</code> call; or between <code class="language-plaintext highlighter-rouge">wait_end()</code> and <code class="language-plaintext highlighter-rouge">wait_begin()</code> calls; or between <code class="language-plaintext highlighter-rouge">wait_end()</code> call and request completion;</p>
</li>
<li>
<p>the duration of one off-CPU round for thread: the time between <code class="language-plaintext highlighter-rouge">wait_begin()</code> and <code class="language-plaintext highlighter-rouge">wait_end()</code> calls;</p>
</li>
<li>
<p>the number of active rounds during one request serving;</p>
</li>
<li>
<p>the time interval between request completion and selection of the same persistent connection by polling to assign it new thread and start the next request.</p>
</li>
</ul>
<p>The output of model is the average number of served queries per second and the average latency of one request serving.
The model is built on the following classes: <em>Threadpool</em> (singleton), <em>Threadgroup</em>, <em>Connection</em>, <em>Thread</em>, <em>Timer</em> (singleton). States for Thread instances are the following:</p>
<ul>
<li>
<p><em>Creating</em> – thread creation;</p>
</li>
<li>
<p><em>Active</em> – request serving;</p>
</li>
<li>
<p><em>Waiting</em> – input-output waiting;</p>
</li>
<li>
<p><em>Idle</em> – previous request is completed, but the next is not assigned yet;</p>
</li>
<li>
<p><em>Polling</em> – only one thread can be at this state by the moment. This thread is responsible for polling (performs <code class="language-plaintext highlighter-rouge">select()</code> API) and called listener.</p>
</li>
</ul>
<p>States for Connection instances are the following:</p>
<ul>
<li>
<p><em>in usual queue</em> – connection is waiting for thread assignment in usual queue;</p>
</li>
<li>
<p><em>in prio queue</em> – connection is waiting for thread assignment in priority queue (if connection is related to already open transaction);</p>
</li>
<li>
<p><em>threading</em> – thread is assigned to connection, request is being served;</p>
</li>
<li>
<p><em>between</em> – request is completed, connection is waiting for repeated extraction by thread-listener.</p>
</li>
</ul>
<p>Possible transitions are shown on Figures 2 and 3.</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/00bc1fc381529a5479cbbca3ea34ef7e.jpeg" title="Figure 2: State transitions for Thread class"><img src="/assets/images/00bc1fc381529a5479cbbca3ea34ef7e.jpeg" alt="00bc1fc381529a5479cbbca3ea34ef7e" /></a> Figure 2: State transitions for Thread class</td>
<td><a href="/assets/images/eced7dc616bf4b7f9e6e0fd869ff9b49.jpeg" title="Figure 3: State transitions for Connection class"><img src="/assets/images/eced7dc616bf4b7f9e6e0fd869ff9b49.jpeg" alt="eced7dc616bf4b7f9e6e0fd869ff9b49" /></a>Figure 3: State transitions for Connection class</td>
</tr>
</tbody>
</table>
<p>In addition to tp-size there are several parameters which we can play to obtain greater performance:</p>
<ul>
<li>
<p><em>oversubscribe</em> – defines maximum number of active threads in one group;</p>
</li>
<li>
<p><em>timer_interval</em> – the time interval before activities of <em>Timer</em> thread;</p>
</li>
<li>
<p><em>queue_put_limit</em> – wake or create thread in <code class="language-plaintext highlighter-rouge">queue_put()</code> if the number of active threads in group is less or equal to this parameter;</p>
</li>
<li>
<p><em>woct_top_limit</em> – create new thread in <code class="language-plaintext highlighter-rouge">wake_or_create_thread()</code> only if the number of active threads in the group is less or equal to this parameter;</p>
</li>
<li>
<p><em>create_thread_on_wait</em> – Boolean parameter. Define would new thread be created in <code class="language-plaintext highlighter-rouge">wait_begin()</code>;</p>
</li>
<li>
<p><em>idle_timeout</em> – maximum time thread can be in idle state;</p>
</li>
<li>
<p><em>listener_wake_limit</em> – listener wakes an idle thread if number of active threads is less or equal to this parameter;</p>
</li>
<li>
<p><em>listener_create_limit</em> – listener creates a new thread if the number of active threads is less or equal to this parameter.</p>
</li>
</ul>
<p>Model call graph is shown on Fig.4.</p>
<figure class="align-center">
<a href="/assets/images/ad57d37ab6a4c7395bdc325e4798afe4.jpeg" title="Figure 4: Simulation model call graph">
<img src="/assets/images/ad57d37ab6a4c7395bdc325e4798afe4.jpeg" alt="" /></a>
<figcaption>Figure 4: Simulation model call graph</figcaption>
</figure>
<p>Titles of methods are listed in Table 2. The main loop of model is written in Listing 1.</p>
<p style="text-align: center; font-size:0.7em;">Table 2. Classes and methods</p>
<table>
<thead>
<tr>
<th>Class::method</th>
<th>Class::method</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadpool::run</code></td>
<td>12. <code class="language-plaintext highlighter-rouge">Threadgroup::check_stall</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Timer::run</code></td>
<td>13. <code class="language-plaintext highlighter-rouge">Threadgroup::queue_put</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadpool::add_connection</code></td>
<td>14. <code class="language-plaintext highlighter-rouge">Threadgroup::queue_get</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadgroup::run</code></td>
<td>15. <code class="language-plaintext highlighter-rouge">Connection::to_threading</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadpool::check_stall</code></td>
<td>16. <code class="language-plaintext highlighter-rouge">Thread::to_polling</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadgroup::add_connection</code></td>
<td>17. <code class="language-plaintext highlighter-rouge">Threadgroup::get_connection_from_polling</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadgroup::assign_connection_to_thread</code></td>
<td>18. <code class="language-plaintext highlighter-rouge">Connection::to_usual_queue</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Thread::to_active</code></td>
<td>19. <code class="language-plaintext highlighter-rouge">Thread::to_idle</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Thread::to_waiting</code></td>
<td>20. <code class="language-plaintext highlighter-rouge">Connection::to_prio_queue</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Connection::to_between</code></td>
<td>21. <code class="language-plaintext highlighter-rouge">Threadgroup::wake_thread</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">Threadgroup::listener</code></td>
<td>22. <code class="language-plaintext highlighter-rouge">Threadgroup::create_worker</code></td>
</tr>
</tbody>
</table>
<p style="text-align: center; font-size:0.7em;">Listing 1.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define NUMBER_OF_TACTS 60000000 </span><span class="cm">/*in mcs */</span><span class="cp">
</span><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">Threadpool</span> <span class="o">*</span><span class="n">tpl</span> <span class="o">=</span> <span class="n">Threadpool</span><span class="o">::</span><span class="n">getInstance</span><span class="p">();</span>
<span class="n">Timer</span> <span class="o">*</span><span class="n">tmr</span> <span class="o">=</span> <span class="n">Timer</span><span class="o">::</span><span class="n">getInstance</span><span class="p">();</span>
<span class="cm">/*initialize random number generator*/</span>
<span class="n">srand</span><span class="p">((</span><span class="kt">unsigned</span><span class="p">)</span><span class="n">time</span><span class="p">(</span><span class="mi">0</span><span class="p">));</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">long</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">NUMBER_OF_TACTS</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Tpl</span><span class="o">-></span><span class="n">run</span><span class="p">();</span>
<span class="n">Tmr</span><span class="o">-></span><span class="n">run</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">delete</span> <span class="n">tpl</span><span class="p">;</span>
<span class="k">delete</span> <span class="n">tmr</span><span class="p">;</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Now let’s consider how model takes into account time consumption of CPU switches from one thread context to another. This phenomena is known as thread contention and it is the reason of performance degradation when the number of groups has reached some threshold value. If we do not take it into account, we will simulate not a life, but something else and our results will cost nothing.
Let’s $N$ is the number of active threads, $M$ is the number of CPU cores, $M < N$, $a$ is the switching time (model parameter), $t$ is time of request serving (<em>request length</em> in terms of queueing theory). Then model time goes ahead for all requests in one tick not on 1, but on $\frac{M}{N} -a$ value. So, $N$ requests will be completed on time $\frac{tN}{M-aN}$ and performance equals $\frac{M-aN}{t}$ requests per one time unit. We can see that it actually decreases when $N$ increases. Thus, we can formulate the following rule: if condition</p>
\[M - aN > \frac{N}{\lceil\frac{N}{M}\rceil}\]
<p>is true, the residual length is reduced on $\frac{M}{N}-a$ for all $N$ active threads. Otherwise we act as follows: take arbitrary $M$ active threads from $N$ and decrement residual length for each of them, and residual length remains untouched for the rest $M-N$ active threads. The second case means that CPU switching time of thread context is too long, so, the using of CPU sharing is not sensible. Truncated square brackets in condition mean division with upward rounding.</p>
<h2 id="model-validation-and-results">Model validation and results</h2>
<p>Validation of model was performed as follows. First, we got output results (average queries per second and latency) on working MySQL server with widely known testing utility <em>sysbench</em> <sup><a href="#9">[9]</a></sup> written by one of the authors of this paper. At the same time all measurements were logged during this experiment to build all necessary input distributions for simulation model. Then these distributions were used in simulation model and its output results were compared with results of <em>sysbench</em>. Model has shown the most divergence not greater than 2% for all sequence of experiments. In this section we emphasize on comparison of results CPU-bound and IO-bound workloads. Here are examples on differences in input data.
Figures 5 and 6 show histograms for CPU active round latency in microseconds, the length of sample is 1000. In other words, that is the timing of state <em>active</em> for <em>thread</em> instances in terms of our model. Figures 7 and 8 show histograms for off-CPU round latency in microseconds, which is the timing of state <em>waiting</em> for <em>thread</em> instances. Data for CPU-bound workload were collected with 1024 concurrent connections and data for IO-bound workload were collected with 128 ones. We can see that off-CPU round for IO-bound workload is much longer than for CPU-bound, because the rate of IO-actions is higher. It means that tp-size which is greater than the number of CPUs, can result in valuable performance effect. Data for figures 9 and 10 are collected with model. The figures show dependencies of thread pool performance on number of thread groups. The duration of simulation is 60 million ticks (mcs), the number of CPUs is 72. These pictures quite correspond to known patterns, classified, for example, in <sup><a href="#3">[3]</a></sup> . We can see that tp_size > 72 gives nothing for CPU-bound workload, because CPUs are ever busy just the same. That is why the main goal of model is not so much to increase performance but to minimize tp-size. However, we have other situation for IO-bound workload. Performance continues to increase even after tp-size=72, achieving the maximum near the value tp-size=180, then starting to decrease since thread contention. This is the typical case for IO-bound workload and high concurrent connections.</p>
<table>
<tbody>
<tr>
<td><a href="/assets/images/ro_active.jpeg" title="Figure 5: CPU-active round latency for CPU-bound"><img src="/assets/images/ro_active.jpeg" alt="ro_active" /></a> Figure 5: CPU-active round latency for CPU-bound</td>
<td><a href="/assets/images/rw_active.jpeg" title="Figure 6: CPU-active round latency for IO-bound"><img src="/assets/images/rw_active.jpeg" alt="rw_active" /></a> Figure 6: CPU-active round latency for IO-bound</td>
</tr>
<tr>
<td><a href="/assets/images/bc7538a9deb0f6f5848ad21777e22056.jpeg" title="Figure 7: off-CPU round latency for CPU-bound"><img src="/assets/images/bc7538a9deb0f6f5848ad21777e22056.jpeg" alt="bc7538a9deb0f6f5848ad21777e22056" /></a> Figure 7: off-CPU round latency for CPU-bound</td>
<td><a href="/assets/images/e5f401368f629f29a8493a88bc00b5b5.jpeg" title="Figure 8: off-CPU round latency for IO-bound"><img src="/assets/images/e5f401368f629f29a8493a88bc00b5b5.jpeg" alt="e5f401368f629f29a8493a88bc00b5b5" /></a> Figure 8: off-CPU round latency for IO-bound</td>
</tr>
<tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td><a href="/assets/images/ps_64_excellent_example.png" title="Figure 9: “tp-size – throughput” for CPU-bound"><img src="/assets/images/ps_64_excellent_example.png" alt="ps_64_excellent_example" /></a> Figure 9: “tp-size – throughput” for CPU-bound</td>
<td><a href="/assets/images/immod_var.png" title="Figure 10: “tp-size – throughput” for IO-bound"><img src="/assets/images/immod_var.png" alt="immod_var" /></a> Figure 10: “tp-size – throughput” for IO-bound</td>
</tr>
</tbody>
</table>
<h2 id="conclusions">Conclusions</h2>
<p>Let’s formulate where thread pool simulation model could be applied:</p>
<ul>
<li>
<p>to reveal parameters and local algorithmic decisions to which thread pool performance is most sensitive and to suggest server tuning recommendations, which could be useful for software engineer and DBA;</p>
</li>
<li>
<p>to reveal dependencies of model output on input distributions;</p>
</li>
<li>
<p>to find optimal values of parameters and to reveal their dependencies on quantitative and qualitative indicators of server workload;</p>
</li>
<li>
<p>dynamic optimization: to collect and treat statistics on working DBMS server with subsequent run of model for the quick search of optimal tp-size and other important parameters.</p>
</li>
</ul>
<p>And finally a few words about ML approach in database and how simulation can help. This approach is actively investigated in the last years <sup><a href="#23">[23]</a></sup>. The situation is following. Some separate software module (for example, thread pool) is configured by several parameters, chosen by developers or DBA. As all set of these parameters in total as each of them in particular affect the output which is some performance measure. Thus, some tuple of parameters’ values gives the maximum performance, so, the goal is to find the optimal tuple of parameters.
The optimal tuple varies in wide range depending on various items, such as:</p>
<ul>
<li>
<p>server’s hardware configuration (number and types of CPU, volume of RAM and swap partition, etc.); operation system and job scheduling algorithms;</p>
</li>
<li>
<p>current load from clients in sense of quantity (number of simultaneous connections);</p>
</li>
<li>
<p>current load from clients in sense of types (distribution of request length and availability of consumed resources, such as CPU and disk).</p>
</li>
</ul>
<p>To find the optimal tuple simulation model can be applied. Found optimal tuple corresponds to some fixed workload profile on the given server. If we describe this profile more or less completely, we can expect that next time when profile will be approximately the same we have already known optimal tuple. Just this idea is the cornerstone of proposed approach. The general plan is:</p>
<ul>
<li>
<p>to profile the DBMS code in proper way and collect workload input data;</p>
</li>
<li>
<p>to find optimal tuple for collected input data with simulation model;</p>
</li>
<li>
<p>to put found optimal tuple into the correspondence of collected data, thus getting the new record of train dataset;</p>
</li>
<li>
<p>when a training dataset will become large enough we train our ML model on it, then try to apply this model on working DBMS server.</p>
</li>
</ul>
<p>According to <sup><a href="#16">[16]</a></sup>, we can perform ML-procedures just in database by means of proposed SQL-language extension. We do not need address the external ML tool after extraction of train dataset from the database. It seems that being implemented, results of this work will significantly improve the efficiency of described approach.</p>
<p>Results of simulation model application will be published in the next articles.</p>
<h2 id="references">References</h2>
<p><a name="1">1</a>: Better performance through threading. – URL: <a href="https://developer.android.com/topic/performance/threads">https://developer.android.com/topic/performance/threads</a></p>
<p><a name="2">2</a>: F. S. Boer, I. Grabe, M. M. Jaghoori, A. Stam, W. Yi, Modeling and Analysis of Thread-Pools in an Industrial Communication Platform. – ICFEM’09: Proceedings of the 11-th International Conference on Formal Engineering Methods, November 2009, pp.367-386. doi: 10.1007/978-3-642-10373-5_19.</p>
<p><a name="3">3</a>: X. Dongping, Performance study and dynamic optimization design for threadpool systems (2004). – URL: <a href="https://digital.library.unt.edu/ark:/67531/metadc780878/m2/1/high_red_d/85380.pdf">https://digital.library.unt.edu/ark:/67531/metadc780878/m2/1/high_red_d/85380.pdf</a></p>
<p><a name="4">4</a>: E. Fuentes, Concurrency – Throttling Concurrency in the CLR 4.0 Threadpool (September 2010). – URL: <a href="https://docs.microsoft.com/en-us/archive/msdn-magazine/2010/September/concurrency-throttling-concurrency-in-the-clr-4-0-threadpool">https://docs.microsoft.com/en-us/archive/msdn-magazine/2010/September/concurrency-throttling-concurrency-in-the-clr-4-0-threadpool</a></p>
<p><a name="5">5</a>: J. L. Hellerstein, V. Morrison, E. Eilebrecht, Applying Control Theory in the Real World. – ACM’SIGMETRICS Performance Evaluation Rev., Volume 37, Issue 3, 2009, pp.38-42. doi: 10.1145/1710115.1710123.</p>
<p><a name="6">6</a>: J. L. Hellerstein, V. Morrison, E. Eilebrecht, Optimizing Concurrency Levels in the .NET Threadpool. – FeBID Workshop 2008, Annapolis, MD USA.</p>
<p><a name="7">7</a>: A. Ilinchik, How to set an ideal thread pool size (April 2019). – URL: <a href="https://engineering.zalando.com/posts/2019/04/how-to-set-an-ideal-thread-pool-size.html">https://engineering.zalando.com/posts/2019/04/how-to-set-an-ideal-thread-pool-size.html</a></p>
<p><a name="8">8</a>: Java Concurrency in lock optimization and optimization thread pool. – URL: <a href="https://programmersought.com/article/84012626442">https://programmersought.com/article/84012626442</a></p>
<p><a name="9">9</a>: A. Mughees, How to benchmark performance of MySQL using Sysbench (June 2020). – URL: <a href="https://ittutorial.org/how-to-benchmark-performance-of-mysql-using-sysbench">https://ittutorial.org/how-to-benchmark-performance-of-mysql-using-sysbench</a></p>
<p><a name="10">10</a>: S. Nazeer, F. Bahadur, Prediction and Frequency Based Dynamic Thread Pool System. – International Journal of Computer Science and Information Security, Vol. 14, No. 5, May 2016, pp.299-308.</p>
<p><a name="11">11</a>: Oracle GlassFish Server 3.1 Performance Tuning Guide. – URL: <a href="https://docs.oracle.com/cd/E18930_01/pdf/821-2431.pdf">https://docs.oracle.com/cd/E18930_01/pdf/821-2431.pdf</a></p>
<p><a name="12">12</a>: K. Pepperdine, Tuning the Size of Your Thread Pool (May, 2013). – URL: <a href="https://infoq.com/articles/Java-Thread-Pool-Performance-Tuning">https://infoq.com/articles/Java-Thread-Pool-Performance-Tuning</a></p>
<p><a name="13">13</a>: Percona Server for MySQL: Thread Pool. – URL: <a href="https://www.percona.com/doc/percona-server/5.7/performance/threadpool.html">https://www.percona.com/doc/percona-server/5.7/performance/threadpool.html</a></p>
<p><a name="14">14</a>: I. Pyarali, M. Spivak, R. Cytron, Evaluating and Optimizing Thread Pool Strategies for Real-Time CORBA. – ACM’SIGPLAN Notices, Volume 36, Issue 8, August 2001, pp. 214-222. doi:10.1145/384198.384226.</p>
<p><a name="15">15</a>: Ronstrom M. MySQL Thread Pool: Summary (October 2011). – URL: <a href="https://mikaelronstrom.blogspot.com/2011/10/mysql-thread-pool-summary.html">https://mikaelronstrom.blogspot.com/2011/10/mysql-thread-pool-summary.html</a></p>
<p><a name="16">16</a>: M. Schüle, F. Simonis, T. Heyenbrock, A. Kemper, S. Günnemann, T. Neumann, In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems. - In: Grust, T., Naumann, F., Böhm, A., Lehner, W., Härder, T., Rahm, E., Heuer, A., Klettke, M. & Meyer, H. (Hrsg.), BTW 2019. Gesellschaft für Informatik, Bonn. pp. 247-266. – URL: <a href="https://dl.gi.de/bitstream/handle/20.500.12116/21700/B6-1.pdf?sequence=1&isAllowed=y">https://dl.gi.de/bitstream/handle/20.500.12116/21700/B6-1.pdf?sequence=1&isAllowed=y</a> doi: 10.184.20/btw2019-16.</p>
<p><a name="17">17</a>: I. Stetsenko, O. Dyfuchyna, Thread Pool Parameters Tuning Using Simulation. – In book: Advances in Computer Science for Engineering and Education II (editor Hu Z.), Springer 2020, pp.78-89. doi: 10.1007/978-3-030-16621-2_8</p>
<p><a name="18">18</a>: R. Terrell Concurrency in .NET: Modern patterns of concurrent and parallel programming. – Simon and Schuster Publishing House, 2018, 568 pp.</p>
<p><a name="19">19</a>: J. Timm, An OS-level adaptive thread pool scheme for I/O-heavy workloads. – Master thesis, Delft University of Technology, 2021. URL: <a href="https://repository.tudelft.nl/islandora/object/uuid%3A5c9b4c42-8fdc-4170-b978-f80cd8f00753">https://repository.tudelft.nl/islandora/object/uuid%3A5c9b4c42-8fdc-4170-b978-f80cd8f00753</a></p>
<p><a name="20">20</a>: Thread Pool in Maria DB. URL: <a href="https://mariadb.com/kb/en/thread-pool-in-mariadb">https://mariadb.com/kb/en/thread-pool-in-mariadb</a></p>
<p><a name="21">21</a>: M. Warren, The CLR Thread Pool ‘Thread Injection’ Algorithm (April 2017). – URL: <a href="https://codeproject.com/Articles/1182012/The-CLR-Thread-Pool-Thread-Injection-Algorithm">https://codeproject.com/Articles/1182012/The-CLR-Thread-Pool-Thread-Injection-Algorithm</a></p>
<p><a name="22">22</a>: What is the ideal Thread Pool Size – Java Concurrency. – URL: <a href="https://techblogstation.com/java/thread-pool-size">https://techblogstation.com/java/thread-pool-size</a></p>
<p><a name="23">23</a>: X. Zhou, J. Sun, Database Meets Artificial Intelligence. – IEEE Transactions on Knowledge and Data Engineering, May 2020. doi: 10.1109/TKDE.2020.2994641.</p>
<p><a name="24">24</a>: URL: <a href="https://github.com/dotnet/coreclr/blob/master/src/vm/win32threadpool.cpp">https://github.com/dotnet/coreclr/blob/master/src/vm/win32threadpool.cpp</a></p>
<p><a name="25">25</a>: M. S. Akopyan, Using multithreaded processes in ParJava environment. - Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2015;27:2 (In Russian). – URL: <a href="http://www.mathnet.ru/links/a7d9a523f1eb29a3745bd7209c3765aa/tisp119.pdf">http://www.mathnet.ru/links/a7d9a523f1eb29a3745bd7209c3765aa/tisp119.pdf</a> doi: 10.15514/ISPRAS-2015-27(2)-1.</p>
<p><a name="26">26</a>: V. A. Klyachin, Parallel Algorithm of Geometrical Hashing Based on NumPy Package and Processes Pool. – Vestnik Volgogradskogo Universiteta, seriya 1, Mat.-Fiz., 2015, issue 4 (29), pp. 13-23, (in Russian). – URL: <a href="https://www.mathnet.ru/links/465bab7745fcdb80f25de7c0f18b0a07/vvgum71.pdf">https://www.mathnet.ru/links/465bab7745fcdb80f25de7c0f18b0a07/vvgum71.pdf</a> doi: 10.15688/jvolsul.2015.42.</p>
<p><a name="27">27</a>: I. I. Trub, Object-oriented simulation on C++. – Piter Publishing House, 2005. – 416 p. (in Russian). – URL: <a href="https://inftechgroup.ucoz.com/load/knigi_po_programmirovaniju/obektno_orientirovannoe_programmirovanie/obektno_orientirovannoe_modelirovanie_na_c/2-1-0-43">https://inftechgroup.ucoz.com/load/knigi_po_programmirovaniju/obektno_orientirovannoe_programmirovanie/obektno_orientirovannoe_modelirovanie_na_c/2-1-0-43</a> ISBN: 5-469-00893-2.</p>Ilya TrubThis is a blog post version of our paper “Simulation of thread pool for database server” that will be published in CEUR Workshop Proceedings. In the article, we consider an object-oriented simulation model for thread pool. The implementation of thread pool in MariaDB and Percona Server was taken as a basis. Model’s input flow and their distributions are described. Model’s output results are consistent with known “concurrency level – throughput” dependency patterns for IO- and CPU-bound workloads. The model is written in C++ and its software architecture is also considered, including provided classes, methods and call graph. The model takes into account “thread contention” phenomena and mathematical expressions for it were proposed. The built model has a practical value as an effective tool for static and dynamic analysis of the most significant parameters affecting performance and optimal choice of these parameters.