In a previous blogpost we introduced Python UDfs (aka MonetDB/Python). With MonetDB/Python users can perform analytics inside the database, mitigating the costs of data transference.
While UDFs can perform highly efficient data analysis, debugging them has been a long-standing problem. Because the UDFs are executed within the database server, developers cannot use sophisticated debugging techniques (e.g., Interactive Debugging). Instead, they have to resort to inefficient debugging strategies (e.g., Print/Log-file debugging) in an effort to make the code work.
In this blogpost, we show a solution to this problem. From within a pymonetdb (i.e., Python client for MonetDB) connection, we ask the user to provide us with a problematic SQL statement that they want to debug. We then analyze it, extract any required input data from the database server and transfer the input data and the source code of the UDF to the client. Finally, we execute the function in the client process exactly as it would have been executed in the database server. The developer can then easily use interactive debugging tools to aid him in creating and modifying the UDFs.
Loopback queries are supported in our extension only when using console debugging.
Before running MonetDB/Python you need to be sure to have the latest version of NumPy and pymonetdb installed. To do that be sure to issue the following commands. Installation of numpy and pymonetdb is straightforward using pip:
pip install numpy
pip install pymonetdb
You can then run MonetDB by starting the monetdbd daemon or mserver5 as long as you enable MonetDB/Python using the respective commands:
monetdb set embedpy=true pytest
mserver5 --set embedded_py=true
For our example assume a table with integers, like the following:
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (3), (6), (8), (10);
Assume you want to create a UDF that computes the mean absolute deviation of a given column. However, you accidentally introduce a logical error (line in red) by not taking the absolute value in consideration. Usually, you would need to insert print statements to track variable changes.
CREATE OR REPLACE FUNCTION mean_deviation(column INTEGER)
RETURNS DOUBLE LANGUAGE PYTHON
{
mean = 0.0
for i in range (0, len(column)): mean += column[i]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)): distance += column[i] - mean
deviation = distance/len(column)
return deviation;
};
If you wish to debug your function using terminal debugging (i.e., pdb) you can easily call the debug function as presented below:
import pymonetdb
conn = pymonetdb.connect(database= 'demo' ) #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.debug(sql, 'mean_deviation' ) #Console Debugging
You start by opening a database connection using pymonetdb.connect(), later you get a cursor object from the connection and through the cursor object you call the debug() function, sending as parameters the SQL you want to examine and the UDF name you wish to debug.
There is an optional sampling step that only transfers a uniform random sample of the data instead of the full input
data set. If you wish to sample you just need to send the number of elements you wish to get from the sampling
(e.g., c.debug(sql, 'mean_deviation', 10)
in case you desire the subset of 10 elements)
If you prefer to keep the data/function locally in order to use IDEs you can use the export() function. It can be used by calling the export() function instead of debug() as demonstrated below:
import pymonetdb
conn = pymonetdb.connect(database= 'demo' ) #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.export(sql, 'mean_deviation' ) #IDE Debugging
The export function also has the sampling option but also has a filespath option where you can define the location
where the files will be saved (e.g., c.export(sql, 'mean_deviation',10,path/to-save-files/)
). The default is the
current folder.
Below is depicted how the exported function looks like. Notice that some code is automatically generated /replaced in order to make sure that the function is ready to be executed in your IDE of choice. No changes being needed.
import cPickle
def mean_deviation(column):
mean = 0.0
for i in range (0, len(column)): mean += column[i]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)): distance += column[i] - mean
deviation = distance/len(column)
return deviation;
input_parameters = cPickle.load(open( './input_data.bin' , 'rb' ))
mean_deviation(input_parameters[ 'column' ])
Pedro Holanda is a PhD student working at CWI. If you have any questions regarding this blogpost, you can contact him at holanda@cwi.nl.
For implementation details you can check the paper here.