ISFDB Author Communities
There has been recent interest and reseach in the area of
how the Internet (or at least the Web part of the Internet)
organizes itself into communities. The research consists of
the development of various algorithms to study linkage patterns
between web pages. Simplistically, people tend to link to
pages that they are interested in, and pages that link to
each other presumably have similar interests. By measuring
the number of links from a page (out degrees) and the number
of links to a page (in degrees), the result can be used to
discover online communities.
The ISFDB itself is a microcosm of the World Wide Web. It consists
of several hundred thousand web pages, most of which link
to other pages in the ISFDB. From the World Wide Web's point of
view, the ISFDB certainly forms an artificial community. A more
interesting question however is: do the web pages within the ISFDB
tend to form sub-communities?
From a genre perspective, we all know that some stories are
"Analog stories" and that others are "F&SF Stories". The editor
of a particular magazine tends to look for a particular type
of story, and over a period of time authors begin to specifically
write the kinds of stories that maximize the probability of
publication. As with web pages of similar interests, the genre
fiction of various authors also have a tendency to clump
together into self-organizing communities.
What follows then, are a series of connected graphs that show
author relationships for specific SF eras. The algorithm
to generate the graphs is discussed in detail below. In general,
magazines and anthologies are searched, looking for instances when
author A was published alongside author B. The more times author
A was published alongside author B, then the greater likelihood
that they are part of a community. Since communities rise and
fall over the course of time ("The Gernsback Era" vs "The Campbell
Era"), the data is examined in 12 year chunks (more about that
later as well).
So here are 8 graphs, one for each 12 year period. Each graph
consists of those authors who were the most connected with other
authors, shows the linkage between those authors, and is structured
so that communities of authors with high linkages are physically
located near each other. (Note: these graphs are very large, and
you'll need to magnify the image to make out the author names). I'll
add some commentary for each graph at a later date.
- 1914-1925 (T=80%)
- 1926-1937 (T=90%)
- 1938-1949 (T=90%)
- 1950-1961 (T=85%)
- 1962-1973 (T=90%)
- 1974-1985 (T=88%)
- 1986-1997 (T=70%)
- 1998-Present (T=84%)
The Community Algorithm
- First, only those publications that fall into the requested
year span are loaded into the application. This restricts the
graph to the specified time span.
- We want to track the number of times Author X appears
alongside Author Y. This is done by constructing a matrix, and
marking the intersections of X vs Y. Since
there are almost 24,000 authors in the ISFDB, a full matrix would
have over half a billion cells. On the other hand, a particular
author doesn't appear with most of the other authors in the
matrix, so it forms a very sparse matrix.
- Each publication is examined and all author intersections
are recorded in the matrix. In the graphs that appear online,
book reviews, poetry, monthly columns, and interviews are
ignored.
- Once all of the data has been examined and the matrix has
been populated, each row and column of the matrix is summed.
Since each cell describes the number of times Author X has
been published alongside Author Y, summing the row or column
describes the total number of connections Author X has with
all other authors.
- The summed lists are sorted. Those at the top of the list
have the most connections with other authors. Those at the
bottom of the list have the fewest connections with other
authors.
- By taking the summed value for the most connected author
in the matrix, a threshold value is constructed as a percentage
of that value. Authors with summed values lower than the threshold
are thrown away. This is done to keep the graph legible. The
threshold value actually needs to be tuned according to the
time period specified. The goal is to have about 600 connections
on the graph, which translates to a threshold between 75 to 90
percent. The exact threshold is unique to each graph.
- Those authors who exceed the threshold are placed on a working
list. The list is ordered such that those authors with maximum
values will be worked with first. This means that those authors
the most connections to other authors
- For each author on the list, the matrix is searched to find
those authors with the highest number of connections to the target
author. These represent the authors which author X appeared with
most often. In Web research, connectedness is measured via a
combination of links pointing to a page (in degrees), and the
links a page points to (out degrees). In our case, links are
bidirectional, so these are simply referred to as spokes.
For the graphs displayed here, the maximum number of spokes which
can be explored for a particular author is limited to 20. When
a spoke is accepted, connection information is output in
dot format.
- When a particular vertex is visited between Author X and Author Y,
that cell in the matrix is marked so that it won't be output again
when the other other is worked on.
- The final output is fed to the dot application, which generates
a graph in postscript. Dot tends to see things as a hierarchy, so
those authors at the top of the graph represent those with the
most connections to other authors. Dot takes care of the layout
of the graph. The postscript is then run through distill to create
the PDF version.
Back to the main lists page.
Copyright (c) 1995-2005 Al von Ruff