Hi, the project which I'm involved in would require us to integrate MonetDB with "big data technologies" so exporting data from MonetDB to Parquet for further processing in the big data tools would be an obvious choice. Can anyone guide me what it would mean to extend MonetDB with a writer for Parquet? It would be awesome if COPY INTO would be able to produce Parquet efficiently, e.g. using Arrow library: https://arrow.apache.org/docs/cpp/parquet.html Kind regards, Daniel
Maybe the easiest would be to use the COPY INTO ON CLIENT interface and do all the encoding on the client side. In mclient you can use something like COPY (query) INTO 'some file' ON CLIENT; What this does is send the data to the client which is the responsible for writing the file. The client gets the name of the file from the server. This is currently only implemented for mclient using the mapi.c library. But this could also be implemented in other front ends. And of course, there is nothing preventing the front end from formatting the data as something other than CSV. The client tells the server during initial connect that it is capable of doing this, and when time comes, the server tells the client that it needs to write a file and gives it the file name (the string from the COPY command) and the data. On 11/07/2020 15.26, Daniel Glöckner wrote:
Hi,
the project which I'm involved in would require us to integrate MonetDB with "big data technologies" so exporting data from MonetDB to Parquet for further processing in the big data tools would be an obvious choice.
Can anyone guide me what it would mean to extend MonetDB with a writer for Parquet? It would be awesome if COPY INTO would be able to produce Parquet efficiently, e.g. using Arrow library: https://arrow.apache.org/docs/cpp/parquet.html
Kind regards, Daniel
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
I've done that using MonetDB C UDF. It takes a MonetDB table (either tmp
or non-tmp) and outputs Parquet.
Essentially MonetDB C UDF passes table on-heap column pointers to a C++
function. And C++ function writes Parquet using Arrow C++ API.
/* on-heap column pointers */
typedef struct _monetdata {
//C: create pointers to actual data on heap
//bool
char **boolheap;
//short
short **shortheap;
//int32
int **ptrheap;
//float
float **fltheap;
//double
double **dblheap;
//int64
long long **biheap;
//int32-date
int **dateheap;
//char
int *width_str_offset;
unsigned char **cheap;
unsigned short **usheap;
unsigned int **uiheap;
size_t **stheap;
char **tvbase;
} t_monetdata;
/* C++: loop over rows for each Monet on-heap column, for example CHAR type
*/
case 6: // Monet char -> Parquet char
vlj = len[icol];
ba_writer =
static_castparquet::ByteArrayWriter*(rg_writer->NextColumn());
for (i = minrow; i < maxrow; i++) {
if (*(width_str_offset+char_count) == 1)
cptr = tvbase[char_count] + cheap[char_count][i];
else if (*(width_str_offset+char_count) == 2)
cptr = tvbase[char_count] + usheap[char_count][i];
else if (*(width_str_offset+char_count) == 4)
cptr = tvbase[char_count] + uiheap[char_count][i];
else
cptr = tvbase[char_count] + stheap[char_count][i];
if (*cptr == -128){
definition_level = 0;
ba_writer->WriteBatch(1, &definition_level, nullptr,
nullptr);
}
else {
for (k=0; k
Maybe the easiest would be to use the COPY INTO ON CLIENT interface and do all the encoding on the client side. In mclient you can use something like COPY (query) INTO 'some file' ON CLIENT;
What this does is send the data to the client which is the responsible for writing the file. The client gets the name of the file from the server.
This is currently only implemented for mclient using the mapi.c library. But this could also be implemented in other front ends. And of course, there is nothing preventing the front end from formatting the data as something other than CSV.
The client tells the server during initial connect that it is capable of doing this, and when time comes, the server tells the client that it needs to write a file and gives it the file name (the string from the COPY command) and the data.
On 11/07/2020 15.26, Daniel Glöckner wrote:
Hi,
the project which I'm involved in would require us to integrate MonetDB with "big data technologies" so exporting data from MonetDB to Parquet for further processing in the big data tools would be an obvious choice.
Can anyone guide me what it would mean to extend MonetDB with a writer for Parquet? It would be awesome if COPY INTO would be able to produce Parquet efficiently, e.g. using Arrow library: https://arrow.apache.org/docs/cpp/parquet.html
Kind regards, Daniel
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (3)
-
Anton Kravchenko
-
Daniel Glöckner
-
Sjoerd Mullender