MonetDB requires all data that needs to be active at any given point in time to fit into the address space --- and of course to fit on the storage device, (i.e., your disk system). On 32-bit systems, the address space is at most 32-bit (4 GB); in practice, is actually limited to 3 GB or even 2 GB on most systems. On 64-bit systems, the address space can theoretically be 64-bit, but in practice is often "limited" to 48-bit or so --- not that that makes any difference ...
MonetDB excessively uses main memory for processing, but does not require that all data fit in the available physical memory. To handle a dataset that exceeds the available physical memory, MonetDB does not (only) rely on the available swap space, but (also) uses memory-mapped files to exploit disk storage beyond the swap space as virtual memory.
For example, while bulk-loading data (preferably using a COPY INTO statement from a (possibly compressed) CSV file), MonetDB needs to have all columns of the table that is currently being loaded "active", i.e., accessible in the address space. However, during loading, parts of the data are continuously written to the persistent files on disk, i.e., the whole table does not have to fit into main memory. E.g., loading a 100 GB table works fine on a system with 8 GB RAM and 16 GB swap -- provided there is sufficient free disk space.
During query processing, MonetDB requires for each single MAL operation during the query execution that all its inputs, its outputs, and possible temporary data structures fit in the address space. This won't be a problem on 64-bit machines, but may be experienced if you are close to the limit on 32-bit machines. MonetDB automatically resorts to virtual memory and memory-mapped files for large intermediate results. Also (large) persistent tables are accessed using memory mapping.
While running, you might see your mserver5 process' virtual size grow well beyond the available physical memory and possibly also well beyond your swap space. In principle, this is not a problem at all. Most of this virtual size is due to the fact that large base tables (or intermediate results) that reside as files on disk are memory-mapped into the address space. Those parts of the data that are currently not accessed do not consume any physical memory (except possibly for caching purposes).
However, if individual columns of your table(s) and/or individual columns of intermediate results exceed the size of the available physical memory, the performance of MonetDB might (will) decrease due to increased I/O requirements.
MonetDB is well-known for its eagerness to "eat up" all your memory. Fortunately, on Linux, there is a memory resource controller called cgroups, with which one can rein in that behaviour a bit. Let me share some experiences of MonetDB under cgroups here.
Perhaps it is a less known fact that Docker, LXC and other container technologies are implemented using Linux Namespace Isolation and Linux Control Groups, aka cgroups. The latter is a Linux kernel feature introduced in 2008. The main idea behind cgroups is to manage hardware and operating system resources for different groups of processes. Let’s for example say we want to run two database servers on a single host which has 64GB of RAM. If we want each database server to have some guaranteed access to half of this memory capacity, we can set up two special memory cgroups, group_1 and group_2, and put a memory limit of 32 GB on each of them and associate the database server processes with their respective cgroup.
Besides the memory controller there are other subsystems that are useful for database systems (although limiting CPU usage of MonetDB is much easier). For example the cpusets CGroup subsystem is a very useful subsystem to assign specific CPUs to specific processes. can be used to divide the total number of threads equally among multiple database servers. There are CGroup subsystems that deal with labelling and prioritising network packages, (block) devices, snapshotting, CPU usage accounting, etc.
So why would you choose to use a direct cgroup approach instead of a managed container layer like Docker? Well I can imagine that in enterprise production environments certain, (overly) strict security policies prevent the usage of third party tools outside of the tools that come with the OS distribution and some well defined software inventory. Another reason could be that for simple use cases, manual cgroups configurations offers a more lightweight, transparent and dynamic way of implementing resource management. On the other hand, if you need high customisability, then again cgroups offers the most fine-grained access on Linux systems to resource management.
To keep it simple, we will only focus on the memory controller subsystem here, and how to get it done on my Ubuntu 18.04 (Bionic) machine which runs Linux kernel from version 4.15. Furthermore I am using cgroups version 1. There is also a more recent version 2 of cgroups which has a different architecture.
Let’s say that our MonetDB user is running mserver5 locally on his/her desktop. And the user wants to prevent MonetDB from eating all of the machine’s 4 GB RAM. So the user decides to limit mserver5 to cap physical memory usage to 2 GB.
My Ubuntu machine already comes with a couple of cgroup hierarchies for specific tasks. These hierarchies are just a set of directory trees. You can find the special file system called “cgroups” mounted on the root of each of these trees. The hierarchy that we are interested in is found at /sys/fs/cgroup/memory which is the preconfigured root of a cgroup hierarchy that controls memory limits. we can do something like
cd /sys/fs/cgroup/memory
sudo mkdir monetdb-cgroup
cd monetdb-cgroup</pre>
After the creation of the cgroup, notice that it is already filled with special files:
ls
-rw-r--r-- 1 root root 0 Oct 30 11:19 cgroup.clone_children
--w--w--w- 1 root root 0 Oct 30 11:19 cgroup.event_control
-rw-r--r-- 1 root root 0 Oct 30 11:19 cgroup.procs
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.failcnt
--w------- 1 root root 0 Oct 30 11:19 memory.force_empty
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.numa_stat
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.oom_control
---------- 1 root root 0 Oct 30 11:19 memory.pressure_level
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.stat
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.swappiness
-r--r--r-- 1 root root 0 Oct 30 11:19 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Oct 30 11:19 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Oct 30 11:19 notify_on_release
-rw-r--r-- 1 root root 0 Oct 30 11:19 tasks
To put a limit on the memory usage, perform
sudo $(echo 2G > memory.limit_in_bytes)
Assuming that a single monetdb instance was already running, we can then simply add the corresponding mserver5 process to our newly created monetdb-cgroup:
sudo $(pgrep mserver5 > tasks)
Voila! The mserver5 process’ private physical memory usage is now bound by above commands to 2 GB. Even if mserver5 was using more than 2 GB of memory at the time of its cgroup association, all incoming allocations are then swapped out until the process is using less than its cgroup physical memory limit. This even works for sharable memory: as long as only a single process uses the potentially sharable memory pages (e.g. pages associated with memory mapped files), those pages are considered by the kernel to be private and charged to the process that uses them (the most). You can easily see the difference in memory usage using something like the top command.
I have to put up a disclaimer that this is the desired behaviour according to the official Linux kernel documentation and it is produced on the aforementioned Ubuntu system. But before this successful implementation, I have done a similar experiment with a slightly older Ubuntu distribution (16.04 xenial with kernel version 4.4; I believe). On that set up, the cgroup limit didn’t correctly limit the memory usage of MonetDB, because somehow it wasn’t accounting the sharable memory to the private memory of the database server process. I have no clue why this is the case and I am suspecting a bug of some sort in the older kernel.
In case your system doesn’t have a predefined cgroup hierarchy for configuring the memory controller, the steps to create a cgroup hierarchy are pretty straightforward and outlined in the above mentioned kernel documentation.
I have noticed that the directory that contains the roots of the cgroup hierarchies, always has the tmpfs file system mounted. I haven’t found a reason why this is the case or if it also works with another/inherited file system
sudo mkdir /sys/fs/cgroup # Doesn’t necessarily have to be this path
sudo mount -t tmpfs none /sys/fs/cgroup
cd /sys/fs/cgroup
sudo mount -t cgroup none /sys/fs/cgroup/memory -o memory
To configure cgroups, you need root privileges. Either because you need to use the mount command or because predefined cgroups like /sys/fs/cgroup/memory are owned by root and have write-restrictions. However, after you have configured a (parent) cgroup you can change the ownership of that cgroup to a non-root user using chown.
Be aware that with cgroups version1, only one cgroup hierarchy can exist for any unique set of cgroups subsystems. So if we would create another hierarchy, say /sys/fs/cgroup/another-memory, that has only the memory controller as a subsystem (-o memory), then the corresponding mount command for another-memory will fail, since there is already an existing hierarchy with the exact same subsystem(s). But if we would mount it with two subsystems, say the memory AND cpuset controllers (-o cpuset,memory), the mount should succeed.
Sometimes when MonetDB hits the (memory) limit of its CGroup, it might get killed by the OOM-killer. Triggering the OOM-killer seems to be one strategy used by CGroups to enforce the memory limitation on the processes (see also [1] and [2]). To prevent MonetDB from being killed by the OOM-killer, one can set "oom_score_adj" to a very low value, e.g. -1000. Valid values of "oom_score_adj" are integers in the range of -1000 to 1000. The lower the value is, the lower the chance is that a process is going to be killed (see also [3]).
Tried out the above instructions on a freshly installed Ubuntu 20.04. Stumbled upon several problems. Here they are, and their solutions.
First, on Ubuntu 20.04 the commands "sudo $(echo 2G > memory.limit_in_bytes)" and "sudo $(pgrep mserver5 > tasks)" might not work. A workaround is to first become root (i.e. "sudo su"), then execute those commands without "sudo".
Second, if you want to set "memory.memsw.limit_in_bytes", but it is not in the "monetdb-cgroup" directory just created, you can try the following:
STEP 0\. BEFORE:
0.1\. BOOT_IMAGE doesn't have the expected values:
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-28-generic root=UUID=aa167f0b-53ed-4449-b157-15238f0e5644 ro quiet
0.2\. GRUB_CMDLINE_LINUX doesn't have the expected values:
# grep GRUB_CMDLINE_LINUX /etc/default/grub
GRUB_CMDLINE_LINUX=""
STEP 1\. Change GRUB_CMDLINE_LINUX into the following:
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"
STEP 2\. make the grub config file definitive:
# sudo update-grub
STEP 3\. restart the machine
STEP 4\. check we have the expected values:
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.4.0-28-generic root=UUID=aa167f0b-53ed-4449-b157-15238f0e5644 ro cgroup_enable=memory swapaccount=1 quiet
Now you should be able to set memory.memsw.limit_in_bytes.
[1] https://stackoverflow.com/questions/27439910/why-cgroup-s-memory-subsystem-use-oom-killer-instead-of-return-memory-allocation [2] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-memory [3] https://askubuntu.com/questions/60672/how-do-i-use-oom-score-adj