User/Alvonruff/2to3

From ISFDB
Jump to navigation Jump to search

Python3 Conversion

  • Use 2to3 in a manner the same as futurize. This produces results suitable for python3.
  • Update to the official MySQL mysql.connector. There is no python3 support for MySQLdb, which is our current connector. Moving to a new connector is a prerequisite.
  • Fix the following issues:

DATE Field With Month or Day of Zero

Probably the most significant issue found so far is that the mysql.connector does not support datetime values where either the month or day are zero. As stated in the MySQL documentation:

   “Zero” date or time values used through Connector/ODBC are converted automatically to NULL because ODBC cannot handle such values.

This is a fundamental change to Python3 (See https://docs.python.org/3/library/datetime.html):

   class datetime.date(year, month, day)
       All arguments are required. Arguments must be integers, in the following ranges:
       MINYEAR <= year <= MAXYEAR
       1 <= month <= 12
       1 <= day <= number of days in the given month and year
       datetime.MINYEAR - The smallest year number allowed in a date or datetime object. MINYEAR is 1.
       datetime.MAXYEAR - The largest year number allowed in a date or datetime object. MAXYEAR is 9999.

So the following date constructions are no longer allowed in a datetime variable:

  • 0000-00-00
  • 1984-00-00
  • 1984-10-00
  • 8888-00-00
  • 9999-00-00

This means that many dates in the ISFDB are returned as NULL by mysql.connector, which is converted to None in Python, which are then displayed as 'unknown' by the ISFDB. The least invasive workaround would be to use DATE_FORMAT to convert the datetime field before it ever leaves MySQL:

    select *, DATE_FORMAT(title_copyright, '%Y-%m-%d') as title_date  from titles where title_id=92392;

This has the side effect of adding the stringified date as a new field appended to the end of the record. There would then need to be code to patch up the NULL title_copyright field:

   for record in records:
       index = len(record)-1
       record[TITLE_YEAR] = record[index]

This will make the TITLE_YEAR field a string instead of a datetime. Note that this requires the records array to be constructed with the list() operator:

       while record:
               records.append(list(record[0]))
               record = cursor.fetchmany()

This issue is the most time-consuming part of the port to Python3. It's unclear what the best procedure is for moving forward:

  • On the one hand, our python scripts currently treat dates as though they are strings, so why not just make them strings in MySQL?
  • On the other hand, we have SQL statements that select on YEAR(), MONTH(), DAYOFMONTH(), DAY(), etc.. which all require the column to be of type DATE.
  • On the third hand, there are about 1800 unique select statements, with an unknown number that would require modification to stringify the DATE field. Some of these are not straightforward to change (looking at you SQLloadAllTitleReviews() with your four variant clauses with field renaming like 'review.title_copyright as review_date')
  • On the fourth hand, I haven't gotten to the editing portion of the program, so I'm not sure how well it's going to work to store an invalid date through the connector into a DATE field. UPDATE: writing a string value into a DATE field through the connector works fine, so not an issue.

The Current Solution

This is a converter problem. The data is represented fine in MySQL, but the mysql.connector converter wants to map DATE fields onto datetime, and datetime doesn't allow month/days of zero. While it is theoretically possible to write a custom converter, I'm currently going to use raw mode. So the first step is to use the raw argument in the cursor() init routine:

   self.cursor = db.cursor(buffered=True, raw=True)

This puts the cursor into raw mode, which will return each field in the record as a tuple of bytearray entries:

   ( bytearray(b'1026720973'), bytearray(b'Ancient Greek'),)

These can be converted to a string with the str() operator, and that will give us the DATE fields as strings. The drawback is that other fields that are normally INT types are also converted to strings. This can be addressed by checking isdigit() with a followup int() operator. That will convert anything that looks like an integer into an int type. That workaround has two drawbacks on some publication fields:

  • Our ISBN storage compresses ISBNs into strings that often looks like integers, so it's type becomes an int instead of a string, and
  • We store the page count in a string field to handle constructs like "xvii+526". However, when the page count is just "526" the new algorithm promotes it to an int.

These two issues are easy to find and repair, so I'm going with this for now. The algorithm to fetch a record then looks like this:

   record = list()
   fetched = self.cursor.fetchone()
   if (fetched is not None) and (len(fetched) > 0):
       for item in fetched:
           if item == None:
               record.append(item)
           else:
               stritem = str(item, 'latin1')
               if stritem.isdigit():
                   record.append(int(stritem))
               else:
                   record.append(stritem)
       record = tuple([tuple(record)])
   else:
       record = tuple(record)
   return record

MD5 is Not a Distinct Module Under Python3

Use of md5 needs to be replaced with hashlib:

   import hashlib
   result = hashlib.md5(something)

String Methods Cannot Be Accessed via 'string'

Python3 doesn't allow the following syntax:

   string.replace()

It instead should be:

   str.replace()
   str.strip()
   str.rstrip()

Surprise - Division Works Differently on Python3

The following code generates different results on Python2 vs Python3:

   checksum = 112
   remainder = (checksum/10)*10

On python2 it returns the integer 110, while on python3 it returns the float 112.0. Under python3 the division operator always returns a float. This obviously breaks the ISBN checksum calculator. The corrected code would be:

   checksum = 112
   remainder = int(checksum/10)*10

Use of encode() to enforce ISO-8859-1

Under Python2, issuing an encode() as follows:

   value = document[0].firstChild.data.encode('iso-8859-1')

Would create a str object with iso-8859-1 character set encoding. Under Python3 it creates a bytes object with iso-8859-1 character set encoding, and requires a followup decode() to move it back to a str type.

Data coming in from MySQL (and stdin) is already encoded as iso-8859-1, so the encode() statements only changes the str to the bytes type, which doesn't have any str operators, like find, replace, or strip.

See the XML Unicode Debugging section of https://isfdb.org/wiki/index.php?title=User:Alvonruff/Debugging_Remarks for more extensive details.

FieldStorage Needs an Encoding String

Running an editing script (say editauth) which contains latin-1 characters in the form, will cause an exception on the submission script. The documentation says that the default character set for FieldStorage is ASCII, but I have not confirmed that. It should honor an accept-charset argument in the form:

   <form id="data" METHOD="POST" ACTION="/cgi-bin/edit/submitauth.cgi" accept-charset="ISO-8859-1">

But that didn't do anything. Adding an encoding to FieldStorage() does work:

   self.form = cgi.FieldStorage(encoding="iso-8859-1")

It's the same number of edits either way.

has_key

We have a ton of statements in the submission scripts along these lines:

   if self.form.has_key('title_content'):

These generate the following error:

   AttributeError: has_key
     args = ('has_key',)
     with_traceback = <built-in method with_traceback of AttributeError object>

If run through the 2to3 translator, it translates those statements into:

   if 'title_content' in self.form:

which seem to work fine.