ISFDB talk:Data Consistency/Serial Mismatches

From ISFDB
Jump to navigation Jump to search

Some of these don't look like problems. "New Destinies" IS (well, WAS) a magazine, but to allow for the multiple printings, we've had to class it as "Anthology". No consistency problem there really. BLongley 15:50, 18 Sep 2007 (CDT) Using "Serial" to group several pieces within the same book looks wrong though - abuse "Series" for that, I'd say, unless they really are reprints of magazine entries. BLongley 15:50, 18 Sep 2007 (CDT) "(Excerpt)" looks wrong too, unless they're numbered excerpts leading to a whole work. BLongley 15:50, 18 Sep 2007 (CDT) "(Complete Novel)" is a known problem: I know that's what it says in many magazines, but reprints of the exact same work don't cut the mustard by today's standards. Create "Novella/Novellette" Variants of the magazine entries before inclusion in works where they're considered Short-Fiction maybe? Or make the "(Complete Novel)" titles variants of what we'd call them now? BLongley 15:50, 18 Sep 2007 (CDT) I'm not sure where this is going, but I'd sort this lot by title. "Journal from Ellipsia" still confuses me with double-entries here. BLongley 15:50, 18 Sep 2007 (CDT)

Posting the Python code here as per Bill's request:
# initialize the variables that will keep track of different type mismatches
bad_title = 0
bad_pub = 0
bad_xref = 0
mismatches = {}
exceptions = {}
# import the MySQLdb module that will allow Python to access MySQL
import MySQLdb
# establish a connection to MySQL and call it 'db'
db = MySQLdb.connect('localhost', username, password, 'ISFDB')
# create a cursor for the pub-content table
xref_cursor = db.cursor()
# create a cursor for the title table
title_cursor = db.cursor()
# create a cursor for the pubs table
pubs_cursor = db.cursor()
# initialize the list that will hold xref information
xref = []
# initialize the dictionary that will hold title information
titles = {}
# initialize the dictionary that will hold publication information
pubs = {}
# retrieve all rows in the pub_content (cross-reference) table
response = xref_cursor.execute("select * from pub_content;")
i = 0
while 1:
    #retrieve one row of pub_content data
    row = xref_cursor.fetchone()
    #break out of the loop when we finish the last retrieved row
    if row == None:
       break
    i = i +1
    if i % 25000 == 0:
        print "Imported ",i," cross-reference records"
    #xref_id = int(row[0])
    if type(row[1]) is not long:
        bad_xref = bad_xref + 1
        #print type(title_id)
        continue
    title_id = int(row[1])
    if type(row[2]) is not long:
        bad_xref = bad_xref + 1
        #print pub_id
        continue
    pub_id = int(row[2])
    xref.append([title_id,pub_id])
print "Retrieved ",len(xref)," cross-reference records"

# now retrieve all rows in the titles table
response = title_cursor.execute("select title_id,title_ttype,title_title from titles;")
i = 0
while 1:
    #retrieve one row of titles data
    row = title_cursor.fetchone()
    #break out of the loop when we finish the last retrieved row
    if row == None:
       break
    i = i +1
    if i % 25000 == 0:
        print "Imported ",i," title records"
    title_id = int(row[0])
    titles[title_id] = row[1]
        
# now retrieve all rows in the pubs table
response = pubs_cursor.execute("select pub_id,pub_ctype,pub_title from pubs;")
i = 0
while 1:
    #retrieve one row of titles data
    row = pubs_cursor.fetchone()
    #break out of the loop when we finish the last retrieved row
    if row == None:
       break
    i = i +1
    if i % 25000 == 0:
        print "Imported ",i," publications records"
    pub_id = int(row[0])
    pubs[pub_id] = row[1]

#iterate over the xref list and compare title/publication type
i = 0
for index in range(len(xref)):
    i = i + 1
    if i % 25000 == 0:
        print "Analyzed ",i," cross-reference records"
    xref_record = xref[index]
    title_id = xref_record[0]
    pub_id = xref_record[1]
    title_type = titles.get(title_id , "bad key")
    pub_type = pubs.get(pub_id , "bad key")
    if (title_type == "bad key"):
        bad_xref = bad_xref + 1
        continue
    if (pub_type == "bad key"):
        bad_xref = bad_xref + 1
        continue
    if (title_type == "CHAPTERBOOK"):
        mismatches["chapterbooks"] = mismatches.get("chapterbooks" , 0) + 1
        continue
    if pub_type == title_type:
        continue
    if (pub_type == "None") or (pub_type == ""):
        mismatches["missing publication type"] = mismatches.get("missing publication type" , 0) + 1
        continue
    if (title_type == "ESSAY") or (title_type == "REVIEW") or (title_type == "COVERART") or (title_type == "INTERIORART") or (title_type == "BACKCOVERART") or (title_type == "INTERVIEW"):
        continue
    if title_type == "OMNIBUS":
        if pub_type != "OMNIBUS":
            mismatches["omnibuses"] = mismatches.get("omnibuses" , 0) + 1
        continue
    if (title_type == "ANTHOLOGY") and (pub_type != "ANTHOLOGY"):
        mismatches["anthologies"] = mismatches.get("anthologies" , 0) + 1
        continue
    if (title_type == "EDITOR"):
        if (pub_type != "MAGAZINE") and (pub_type != "FANZINE") :
            mismatches["editor records"] = mismatches.get("editor records" , 0) + 1
        continue
    if (title_type == "SERIAL"):
        if (pub_type != "MAGAZINE") and (pub_type != "FANZINE") :
            mismatches["serials"] = mismatches.get("serials" , 0) + 1
            exceptions["serials", (title_id , pub_id)] = ""
        continue
    if (title_type == "COLLECTION") and (pub_type != "COLLECTION"):
        mismatches["collections"] = mismatches.get("collections" , 0) + 1
        continue
    if (title_type == "NOVEL"):
        if pub_type == "NOVEL":
            continue
        if pub_type == "OMNIBUS":
            continue
        if (pub_type == "MAGAZINE") or (pub_type == "FANZINE"):
            mismatches["novel/magazines"] = mismatches.get("novel/magazines" , 0) + 1
            continue
        mismatches["novels (other)"] = mismatches.get("novels (other)" , 0) + 1
        continue
    if (title_type == "SHORTFICTION"):
        if (pub_type == "NONFICTION") or (pub_type == "NOVEL"):
            mismatches["short fiction"] = mismatches.get("short fiction" , 0) + 1
        continue
    if (title_type == "POEM"):
        if (pub_type == "NONFICTION"):
            mismatches["poems"] = mismatches.get("poems" , 0) + 1
        continue
    if (title_type == "NONFICTION"):
        if (pub_type != "NONFICTION"):
            mismatches["nonfiction"] = mismatches.get("nonfiction" , 0) + 1
        continue
    if (title_type == "NONGENRE"):
        if (pub_type != "NOVEL"):
            mismatches["nongenre"] = mismatches.get("nongenre" , 0) + 1
        continue
    mismatches["uncategorized"] = mismatches.get("uncategorized" , 0) + 1
    #print title_data[0][1],title_type," vs. ",pub_data[0][1],pub_type

print "Bad title records: ",bad_title
print "Bad publication records: ",bad_pub
print "Bad cross-references: ",bad_xref
for key in exceptions:
    title_id = key[1][0]
    pub_id = key[1][1]
    response = title_cursor.execute("select title_ttype,title_title from titles where title_id = "+str(title_id)+";")
    row1 = title_cursor.fetchall()
    response = pubs_cursor.execute("select pub_ctype,pub_title,pub_tag from pubs where pub_id = "+str(pub_id)+";")
    row2 = pubs_cursor.fetchall()
    print "|-"
    print "|[http://www.isfdb.org/cgi-bin/title.cgi?"+str(title_id)+" "+str(row1[0][1])+"]"
    print "|"+str(row1[0][0])
    print "|[http://www.isfdb.org/cgi-bin/pl.cgi?"+str(row2[0][2])+" "+str(row2[0][1])+"]"
    print "|"+str(row2[0][0])

for key in mismatches:
    print "Mismatched",key,":",mismatches.get(key , 0)

db.close()
Ahasuerus 23:45, 18 Sep 2007 (CDT)
Some of the title appear to be correct. The titles have appeared in serial form in collections and anthologys first, but may not have been published as complete novels. So there listing as a serial maybe correct, they just have not been published in the usual magazine form.Kraang 06:12, 20 Sep 2007 (CDT)
The point about sorting by title is well taken, I will add it next when I have a moment. I will also add a Status column, where we could enter "FIXED" as we update the data. Ahasuerus 23:44, 19 Sep 2007 (CDT)
The table has been resorted by title and a "Fixed?" column has been added. However, my attempts to teach myself Python turned out to be more painful than expected (something to do with old dogs and new tricks), so I ended up converting the MySQL database to a more familiar format and parsing it using vintage 1960s technologies. I'll attack other type mismatches shortly. Ahasuerus 23:27, 29 Sep 2007 (CDT)