Advanced Features

Advanced Features mk Tue, 02/18/2020 - 14:06

MonetDB has been designed to use the available resources without many tuning knobs. Functional enhancements are collected in autoloading scripts, which are executed when a database is created.

Other resources are mostly gathered upon need from the operating system environment. MonetDB aims to use as much of the main memory available, as many cores as can be practically deployed in parallel processing of queries, and trying to avoid going to a slow disk. As such, it is a rather aggressive system, which find immediate competition with ongoing heavy demand applications, e.g. a browser with a ton of open windows.

We recommend to use MonetDB in production settings on a dedicated server or a sizeable Cloud server.

Autoloading Scripts

Autoloading Scripts mk Mon, 05/04/2020 - 14:08

Autoloading  is used to initialize a new SQL catalog with scripts containing tables, functions, and procedures while creating a new database with monetdb. Such scripts are kept in the MonetDB library area ./lib/monetdb5/createdb/ . Look them up for the details. The prefix number is used to direct the order of their execution.  A summary of the current collection is illustrated below

Script Description
09_like.sql String regular pattern matching using PCRE
10_math.sql Polar arithmetic
12_url.sql URL manipulations (see manual)
13_date.sql POSIX formatted date coercions ((see man strptime/strftime)
14_inet.sql PostgreSQL inet data type (see manual)
15_querylog.sql Query history management (see manual)
16_tracelog.sql Query trace table (see manual)
17_temporal.sql Additional date, time and timestamp functions
18_index.sql Experimental Ordered index functions
20_vacuum.sql Table vacuum commands (experimental!)
21_dependency_views.sql Schema object dependencies analysis
22_clients.sql Client administration procedures
23_skyserver.sql Application support library
25_debug.sql Debugging tools for optimziers
26_sysmon.sql Active SQL query queue management
27_rejects.sql COPY into reject management
Analytic functions
40_geom.sql Optional Geometry extension library (see manual)
JSON data type support functions
41_md5sum.sql md5 checksum function
45_uuid.sql UUID data type functionality
46_profiler.sql start/stop collecting MonetDB profiler traces in a profiler_logs pool with a specific heartbeat
51_sys_schema_extension.sql Additional tables / views in "sys" schema.
60_wlcr.sql Workload Capture and Replay functions.
72_fits.sql Optional Flexible Image Transport System (FITS) files extension library
(see astronomical-data-analysis-monetdb-data-vaults)
74_netcdf.sql Optional Network Common Data Form (NetCDF) files extension library
(see Extensions/DataVaults)
75_lidar.sql Optional LiDAR data files extension library
75_shp.sql Optional SHP files extension library
75_storagemodel.sql Storage footprint analysis and prediction.
80_statistics.sql Analyze tables for optimizer statistics
Example C UDF functions.
85_bam.sql Optional BAM/SAM (Sequence Alignment/Map) format files extension library
(see BAM/SAM module)
Table producing functions to generate series of numbers or timestamps
99_system.sql System administration (internal)

Along the same line, MonetDB library area ./lib/monetdb5/autoload/ contains MAL scripts to be executed each time the server is started.

Data Compression

Data Compression mk Mon, 05/04/2020 - 14:43

A distinctive feature of column stores is to apply aggressive data compression. However, compression is often a two-edged sword, where movement of large data files over relative slow networks, disk access or memory interconnects is compensated for by applying CPU cycles. The effectiveness of which strongly depends on the mapping from database schema into data structures,  their high maintenance cost, the relational algorithms,  the system architecture, and the data distribution. The compression ratios cited  depends on the input size, which is commonly assumed to be in CSV format, the data distribution, and the database storage footprint, with or without auxiliary data structures like indices.

MonetDB applies different compression techniques automatically at many levels.

The column store representation is highly optimized, where the basic storage structure is a dense array, i.e. without holes to accommodate future insertions or overhead caused by the data structure itself (e.g. B-trees). This dense representation allows for direct mapping of the database files into memory. The storage width ranges from 1 (byte) to 8 bytes (doubles).  NULL values are part of the domain space, which avoids auxiliary bit masks at the expensive of 'loosing' a single value from the domain.

All strings are stored using dictionary encoding. This significantly reduces their storage space, but with larger dictionaries the maintenance cost may become expensive. Therefore for really large dictionary tables, MonetDB resort to non-compressed string representation. The references into the dictionary table occupy anywhere from 1 to 8 bytes, depending on the number of elements.

During query evaluation, a dense range of results is represented by a column view. This is a small footprint representation of the result set. It avoids both copying the result and storing it in its private column structure.

Disk Space

Disk Space mk Mon, 05/04/2020 - 14:44

The disk space footprint is determined by the way columns are being stored. MonetDB uses dictionary encoding for string columns, but aside from this there is no default compression applied to reduce the disk footprint. The prime reason being the two-headed sword of compression. It saves (cheap) disk space and IO bandwidth at the cost of expensive CPU (de)compression overhead (See compression). Since all columns are memory mapped upon access, i.e. they need not be decompressed.  If disk space comes at a premium and memory residency can be guaranteed for a long time, then a compressed file system, e.g. BTRFS, may become helpful. Its compression behavior is often as good as  dedicated algorithms executed within the critical path of a query execution.

The disk footprint can be assessed using the (Linux) command 'du' on the dbfarm directory or to run the query 'select * from storage();', provided the sql extensions are pre-loaded into your database. (See storage model)

Running out of diskspace

One of the features of MonetDB's execution model is that all intermediates are materialized as memory mapped files. A consequence of this approach is that when memory is too small to keep them around, they will be swapped to disk by the operating system. This can be seen as a decaying free space and ultimately a full disk. In turn this (should) lead to a single transaction abort and removing its disk claim. Evidently, filling your disk depends on the number of users and the complexity of their queries. It aids to the cost of running COPY INTO and queries concurrently.

If you have limited resources for concurrent access then the monetdb funnel may be an option to serialize the user requests.  A single server can have multiple funnels.

Memory Footprint

Memory Footprint mk Mon, 05/04/2020 - 14:10

MonetDB requires all data that needs to be active at any given point in time to fit into the address space --- and of course to fit on the storage device, (i.e., your disk system). On 32-bit systems, the address space is at most 32-bit (4 GB); in practice, it is actually limited to 3 GB or even 2 GB on most systems. On 64-bit systems, the address space can theoretically be 64-bit, but in practice is often "limited" to 48-bit or so --- not that that makes any difference ...

MonetDB excessively uses main memory for processing, but does not require that all data fit in the available physical memory. To handle a dataset that exceeds the available physical memory, MonetDB does not (only) rely on the available swap space, but (also) uses memory-mapped files to exploit disk storage beyond the swap space as virtual memory.

For example, while bulk-loading data (preferably using a COPY INTO statement from a (possibly compressed) CSV file), MonetDB needs to have all columns of the table that is currently being loaded "active", i.e., accessible in the address space. However, during loading, parts of the data are continuously written to the persistent files on disk, i.e., the whole table does not have to fit into main memory. E.g., loading a 100 GB table works fine on a system with 8 GB RAM and 16 GB swap -- provided there is sufficient free disk space.

During query processing, MonetDB requires for each single MAL operation during the query execution that all its inputs, its outputs, and possible temporary data structures fit in the address space. This won't be a problem on 64-bit machines, but may be experienced if you are close to the limit on 32-bit machines. MonetDB automatically resorts to virtual memory and memory-mapped files for large intermediate results. Also (large) persistent tables are accessed using memory mapping.

While running, you might see your mserver5 process' virtual size grow well beyond the available physical memory and possibly also well beyond your swap space.  In principle, this is not a problem at all.  Most of this virtual size is due to the fact that large base tables (or intermediate results) that reside as files on disk are memory-mapped into the address space. Those parts of the data that are currently not accessed do not consume any physical memory (except possibly for caching purposes).

However, if individual columns of your table(s) and/or individual columns of intermediate results exceed the size of the available physical memory, the performance of MonetDB might (will) decrease due to increased I/O requirements.

Multicore Usage

Multicore Usage mk Mon, 05/04/2020 - 14:12

Modern systems come with many CPUs on a single die and further augmented with hyperthreading to increase parallel processing. MonetDB will use as much of the cores as possible for both inter- and intra- parallelism.

A command line argument gdk_nr_threads=<number> can be set to  experiment with a limited number of cores assigned to each user session.

OOM Killer

OOM Killer mk Mon, 05/04/2020 - 14:46

The MonetDB server may become a victim of Linux kernel functionality  called Out Of Memory Killer (or OOM Killer) responsible for dealing with excessive memory requirements.
If the system reaches a point where it may  run out of memory, OOM Killer looks for victim process and ends its life the hard way.

In most cases, a simple restart of the server will suffice, but when the killer enters during a complex update transaction then the database may end-up in an inconsistent state.