Dear all, I’m new to monetdb database programming and have a problem setting up my first database which I can not explain by myself: I’m basically running the following Python code: tempdir = 'E:\\xetra\\tempdir\\monetdb' connection = monetdb.sql.connect(username="monetdb", password="monetdb", hostname="localhost", database="database") # expected table layout connection.execute("CREATE TABLE XXX(Name_1 TEXT, Name_2 TEXT, Name_3 TEXT, Name_4 TEXT, Num_1 INT, Num_2 REAL,Num_3 INT,Name_5 CHARACTER(1))") def convert_table(filename,path): print("Reading CSV") try: start = time.clock() csv_to_sql = path + "\\" + filename connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX FROM '" + csv_to_sql + "' USING DELIMITERS ';','\n' ;") elapsed = (time.clock() - start) connection.commit() print("Time to write to database: " + str(elapsed) + " seconds") except: #get wrong transaction out of DB with rollback connection.rollback() try: #DB conversion start = time.clock() infile = open(csv_to_sql, 'rb') csv_to_sql_converted=path+ '\\converted.csv' outfile = open(csv_to_sql_converted, 'wb') BLOCKSIZE = 100000000 # experiment with size while True: block = infile.read(BLOCKSIZE) if not block: break outfile.write(block.decode('cp1252','strict').encode('utf-8','strict')) infile.close() outfile.close() elapsed = (time.clock() - start) print("Time to convert file to UTF-8: " + str(elapsed) + " seconds") #write to DB start = time.clock() connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX FROM '" + csv_to_sql_converted + "' USING DELIMITERS ';','\n' ;") connection.commit() os.remove(csv_to_sql_converted) elapsed = (time.clock() - start) print("Time to write to database: " + str(elapsed) + " seconds") except Exception as inst: #get wrong transaction out of DB with rollback connection.rollback() print type(inst) # the exception instance print inst.args # arguments stored in .args print inst # __str__ allows args to printed directly print("Could not write file "+csv_to_sql+"to the database XXX") filename_list = [“XYZ.csv”, “ABC.csv”,…] path = “C:\\” for filename in filename_list: convert_table(filename, path) I had to alter the code as the actual csvs contain very sensitive data, which I am not allowed to publish. But the structure is the same. I do not get any error message and the program runs through properly. Usually it goes through the “try:” part of the function. When I look into the resulting database the number of lines is as expected (so the observations must be somewhere in the database) but there are some gaps in the output, meaning that some values from the .csvs are just not copied and I am unable to compare those gap-lines to the original data as the identifier is usually among the missing variables. Some variables from the original data are just missing. This happens (at least to what I can say) in an unsystematic way. I have the impression that MonetDB might just be overwhelmed by the amount of data. Is this possible? The .csvs have each a size of 1.3GB and I’m reading more than 1000 of them into the database which has a final size of about 2 TB. I’m running this under Windows Server 2008 R2, 32GB RAM, AMD Opteron 4234 6-Core 3.10 GHz, using Anaconda. When I manually read in one single file, that made problems beforehand, I will find it in the database without error. So, I’m pretty puzzled what is going on here. Might it happen that the connection.execute-command runs through without triggering an error message but still not reading the whole dataset? I would be very glad about any help from the community. Thanks, Thomas
Thomas, can you try to first convert all cp1252 input files to utf-8 input files, and then load the latter via mclient (e.g., by putting all copy into statements into a single SQL file). Eliminating the python interface would have to check whether the problem originates in the server or rather in the python interface. Best, Stefan ----- Original Message -----
Dear all,
I’m new to monetdb database programming and have a problem setting up my first database which I can not explain by myself:
I’m basically running the following Python code:
tempdir = 'E:\\xetra\\tempdir\\monetdb'
connection = monetdb.sql.connect(username="monetdb", password="monetdb", hostname="localhost", database="database")
# expected table layout
connection.execute("CREATE TABLE XXX(Name_1 TEXT, Name_2 TEXT, Name_3 TEXT, Name_4 TEXT, Num_1 INT, Num_2 REAL,Num_3 INT,Name_5 CHARACTER(1))")
def convert_table(filename,path):
print("Reading CSV")
try:
start = time.clock()
csv_to_sql = path + "\\" + filename
connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX FROM '" + csv_to_sql + "' USING DELIMITERS ';','\n' ;")
elapsed = (time.clock() - start)
connection.commit()
print("Time to write to database: " + str(elapsed) + " seconds")
except:
#get wrong transaction out of DB with rollback
connection.rollback()
try:
#DB conversion
start = time.clock()
infile = open(csv_to_sql, 'rb')
csv_to_sql_converted=path+ '\\converted.csv'
outfile = open(csv_to_sql_converted, 'wb')
BLOCKSIZE = 100000000 # experiment with size
while True:
block = infile.read(BLOCKSIZE)
if not block: break
outfile.write(block.decode('cp1252','strict').encode('utf-8','strict'))
infile.close()
outfile.close()
elapsed = (time.clock() - start)
print("Time to convert file to UTF-8: " + str(elapsed) + " seconds")
#write to DB
start = time.clock()
connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX FROM '" + csv_to_sql_converted + "' USING DELIMITERS ';','\n' ;")
connection.commit()
os.remove(csv_to_sql_converted)
elapsed = (time.clock() - start)
print("Time to write to database: " + str(elapsed) + " seconds")
except Exception as inst:
#get wrong transaction out of DB with rollback
connection.rollback()
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst # __str__ allows args to printed directly
print("Could not write file "+csv_to_sql+"to the database XXX")
filename_list = [“XYZ.csv”, “ABC.csv”,…]
path = “C:\\”
for filename in filename_list:
convert_table(filename, path)
I had to alter the code as the actual csvs contain very sensitive data, which I am not allowed to publish. But the structure is the same.
I do not get any error message and the program runs through properly. Usually it goes through the “try:” part of the function. When I look into the resulting database the number of lines is as expected (so the observations must be somewhere in the database) but there are some gaps in the output, meaning that some values from the .csvs are just not copied and I am unable to compare those gap-lines to the original data as the identifier is usually among the missing variables. Some variables from the original data are just missing. This happens (at least to what I can say) in an unsystematic way. I have the impression that MonetDB might just be overwhelmed by the amount of data. Is this possible? The .csvs have each a size of 1.3GB and I’m reading more than 1000 of them into the database which has a final size of about 2 TB. I’m running this under Windows Server 2008 R2, 32GB RAM, AMD Opteron 4234 6-Core 3.10 GHz, using Anaconda.
When I manually read in one single file, that made problems beforehand, I will find it in the database without error. So, I’m pretty puzzled what is going on here.
Might it happen that the connection.execute-command runs through without triggering an error message but still not reading the whole dataset?
I would be very glad about any help from the community.
Thanks,
Thomas
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
Dear Stefan, first of all: thanks for your quick response. I tried your idea for a subset of our data and it worked. However, as I tried to explain in my earlier mail, also in Python it generally works. If I repeatedly read in the same csv again and again, I get the desired output in the database in 90% of the cases but sometimes it just doesn't copy all information from the csv. If it was for the python interface I would expect that it either works or not. Best, Thomas
participants (2)
-
Stefan Manegold
-
Thomas Johann