Database Problem

23 Aug 2013

      Dear all,

I’m new to monetdb database programming and have a problem setting up my
first database which I can not explain by myself:

I’m basically running the following Python code:

tempdir = 'E:\\xetra\\tempdir\\monetdb'

connection = monetdb.sql.connect(username="monetdb", password="monetdb",
hostname="localhost", database="database")

# expected table layout

connection.execute("CREATE TABLE XXX(Name_1 TEXT, Name_2 TEXT, Name_3 TEXT,
Name_4 TEXT, Num_1 INT, Num_2 REAL,Num_3 INT,Name_5 CHARACTER(1))")

def convert_table(filename,path):

    print("Reading CSV")

    try:

        start = time.clock()

        csv_to_sql = path + "\\"  + filename

        connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX FROM
'" + csv_to_sql + "' USING DELIMITERS ';','\n' ;")

        elapsed = (time.clock() - start)

        connection.commit()

        print("Time to write to database: " + str(elapsed) + " seconds")

    except:

        #get wrong transaction out of DB with rollback

        connection.rollback()

        try:

            #DB conversion

            start = time.clock()

            infile = open(csv_to_sql, 'rb')

            csv_to_sql_converted=path+ '\\converted.csv'

            outfile = open(csv_to_sql_converted, 'wb')

            BLOCKSIZE = 100000000 # experiment with size

            while True:

               block = infile.read(BLOCKSIZE)

               if not block: break

               outfile.write(block.decode('cp1252','strict').encode('utf-8','strict'))

            infile.close()

            outfile.close()

            elapsed = (time.clock() - start)

            print("Time to convert file to UTF-8: " + str(elapsed) + "
seconds")

            #write to DB

            start = time.clock()

            connection.execute("COPY 100000000 OFFSET 2 RECORDS INTO XXX
FROM '" + csv_to_sql_converted + "' USING DELIMITERS ';','\n' ;")

            connection.commit()

            os.remove(csv_to_sql_converted)

            elapsed = (time.clock() - start)

            print("Time to write to database: " + str(elapsed) + " seconds")

        except Exception as inst:

            #get wrong transaction out of DB with rollback

            connection.rollback()

            print type(inst) # the exception instance

            print inst.args # arguments stored in .args

            print inst      # __str__ allows args to printed directly

            print("Could not write file "+csv_to_sql+"to the database
XXX")

filename_list = [“XYZ.csv”, “ABC.csv”,…]

path = “C:\\”

for filename in filename_list:

convert_table(filename, path)

I had to alter the code as the actual csvs contain very sensitive data,
which I am not allowed to publish. But the structure is the same.

I do not get any error message and the program runs through properly.
Usually it goes through the “try:” part of the function. When I look into
the resulting database the number of lines is as expected (so the
observations must be somewhere in the database) but there are some gaps in
the output, meaning that some values from the .csvs are just not copied and
I am unable to compare those gap-lines to the original data as the
identifier is usually among the missing variables. Some variables from the
original data are just missing. This happens (at least to what I can say)
in an unsystematic way. I have the impression that MonetDB might just be
overwhelmed by the amount of data. Is this possible? The .csvs have each a
size of 1.3GB and I’m reading more than 1000 of them into the database
which has a final size of about 2 TB. I’m running this under Windows Server
2008 R2, 32GB RAM, AMD Opteron 4234 6-Core 3.10 GHz, using Anaconda.

When I manually read in one single file, that made problems beforehand, I
will find it in the database without error. So, I’m pretty puzzled what is
going on here.

Might it happen that the connection.execute-command runs through without
triggering an error message but still not reading the whole dataset?

I would be very glad about any help from the community.

Thanks,

Thomas

Thomas Johann

Stefan Manegold

Thomas Johann

tags

participants (2)