I've done that using MonetDB C UDF. It takes a MonetDB table (either tmp or non-tmp) and outputs Parquet.

Essentially MonetDB C UDF passes table on-heap column pointers to a C++ function. And C++ function writes Parquet using Arrow C++ API.

/* on-heap column pointers */

typedef struct _monetdata {

//C: create pointers to actual data on heap
//bool
char **boolheap;
//short
short **shortheap;
//int32
int **ptrheap;
//float
float **fltheap;
//double
double **dblheap;
//int64
long long **biheap;
//int32-date
int **dateheap;
//char
int *width_str_offset;
unsigned char **cheap;
unsigned short **usheap;
unsigned int **uiheap;
size_t **stheap;
char **tvbase;
} t_monetdata;

/* C++: loop over rows for each Monet on-heap column, for example CHAR type */

case 6: // Monet char -> Parquet char
vlj = len[icol];
ba_writer = static_cast<parquet::ByteArrayWriter*>(rg_writer->NextColumn());
for (i = minrow; i < maxrow; i++) {
if (*(width_str_offset+char_count) == 1)
cptr = tvbase[char_count] + cheap[char_count][i];
else if (*(width_str_offset+char_count) == 2)
cptr = tvbase[char_count] + usheap[char_count][i];
else if (*(width_str_offset+char_count) == 4)
cptr = tvbase[char_count] + uiheap[char_count][i];
else
cptr = tvbase[char_count] + stheap[char_count][i];

if (*cptr == -128){
definition_level = 0;
ba_writer->WriteBatch(1, &definition_level, nullptr, nullptr);
}
else {
for (k=0; k<vlj; k++) {
c = *cptr++;
if (c != 0)
charbuf[k] = c;
else
break;
}
definition_level = 1;
ba_value.ptr = reinterpret_cast<const uint8_t*>(charbuf);
ba_value.len = k;
ba_writer->WriteBatch(1, &definition_level, nullptr, &ba_value);
}
}
char_count++;
break;

Anton

On Mon, Jul 13, 2020 at 12:42 AM Sjoerd Mullender <sjoerd@monetdb.org> wrote:

Maybe the easiest would be to use the COPY INTO ON CLIENT interface and
do all the encoding on the client side.
In mclient you can use something like
COPY (query) INTO 'some file' ON CLIENT;

What this does is send the data to the client which is the responsible
for writing the file. The client gets the name of the file from the server.

This is currently only implemented for mclient using the mapi.c library.
But this could also be implemented in other front ends. And of course,
there is nothing preventing the front end from formatting the data as
something other than CSV.

The client tells the server during initial connect that it is capable of
doing this, and when time comes, the server tells the client that it
needs to write a file and gives it the file name (the string from the
COPY command) and the data.

On 11/07/2020 15.26, Daniel Glöckner wrote:
> Hi,
>
> the project which I'm involved in would require us to integrate MonetDB
> with "big data technologies" so exporting data from MonetDB to Parquet
> for further processing in the big data tools would be an obvious choice.
>
> Can anyone guide me what it would mean to extend MonetDB with a writer
> for Parquet?
> It would be awesome if COPY INTO would be able to produce Parquet
> efficiently, e.g. using Arrow
> library: https://arrow.apache.org/docs/cpp/parquet.html
>
> Kind regards,
> Daniel
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>

--
Sjoerd Mullender
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list