
Hello Alberto, I am looking into this issue and I would like to give you a small update and ask a couple of questions. Some background first. JSON objects are stored in MonetDB as strings. This creates overhead from the object keys and from numeric values that are now stored as strings. If I understand your issue correctly you don't have the chance to commit the reconstructed JSON column. Is this correct? From some tests I performed, I noticed that even before committing, disk usage goes up slowly. This is probably unrelated to JSON itself although I have not had a chance to verify this yet. So I think the most relevant question at this stage is how long are the keys in the JSON objects? I think that these are the most likely cause for the large overhead you are observing. Best regards, Panos. On 2022-09-28 21:17, Alberto Ferrari wrote:

Hi Panos! I'm not sure I fully understand your question "*If I understand your issue correctly you don't have the chance to commit the reconstructed JSON column* " We are working with auto-commit, if that is the question. but I don't think the commit would be an issue here. The problem I saw in the past with json fields and space usage, is that as you mentioned, JSON fields are being considered as Strings. On the other hand, as I know (let me know if I'm wrong) MonetDB stores in the disk the information of every column based on the DISTINCT values it may find for every column. So, if for whatever reason the json keys are saved in different order, or some keys are not included in some rows, that will become in different values (based on a DISTINCT values of strings) which will end up storing in the disk "all the rows" because of the different values it might encounter in the table. BR Ariel On Thu, Oct 6, 2022 at 1:00 PM Panagiotis Koutsourakis < panagiotis.koutsourakis@monetdbsolutions.com> wrote:

On 06/10/2022 18.38, Ariel Abadi wrote:
This is only true up to a point. If the combined string values (strings plus some overhead) is less than 64k, only distinct strings are saved. But if the combined string values grow beyond that, we use a different technique. There is a 1024 entry hash table where each entry points to the last string that was entered that hashes to that bucket. If a new string is added that hashes to a bucket that points to an identical string, no new value is added, but if the string that the bucket points to is different, the new string is added and the bucket is overwritten to point to the new string. This means that if you have two different strings that hash to the same bucket and you add them in alternating fashion, your string storage will grow (if you're past the 64k boundary).
-- Sjoerd Mullender

Hi Ariel, Let me rephrase the question: Does the process that creates the JSON column finish successfully or does the disk get full before it finishes? The second question is: how many bytes are the keys in each JSON object? I am trying to understand if the disk usage you see is normal operation and what, if anything, can be optimized. Best regards, Panos. On 2022-10-06 19:38, Ariel Abadi wrote:
participants (4)
-
Alberto Ferrari
-
Ariel Abadi
-
Panagiotis Koutsourakis
-
Sjoerd Mullender