You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

The Usenet Archive

May 08 2015

Even if you think you dont know Usenet, you probably do. Its the Cambrian explosion of the modern Internet, among the first places that an online culture emerged, but modern enough that it can seamlessly blend into the contemporary web. (I was recently trying to work out through Google where I might buy a clavichord in Boston; my hopes were briefly raised about one particular seller until I realized that the modern-looking Google Groups page I was reading was actually a presentation of a discussion from the Usenet archives in 1992.)

Usenet persists; its also the prototype of the modern digital archive. One of the best available sources for early usenet is the Internet Archives UTZOO collection of about 2 million messages from roughly 1981 to 1991. Its too vast to read, frustratingly incomplete, and far more significant in the aggregate than the details. In other words, its a perfect candidate for some some quantitative textual analysis.

First things first; the easiest way to browse this may be through the standard line chart methods. (Although the newsgroup search at the bottom of this post is equally interesting).

Some of usenets most lasting contributions have been abbreviations. Heres the rise of IMHO, for example, starting in around 1987. You can change this to search for any word and click to see examples, although you may prefer to use the bigger interface if youll be sticking around a while.

An Bookworm browser for the early Usenet by year

    { "database": "usenet",
    "plotType": "linechart","words_collation":"Case_Sensitive",
    "search_limits": {"word": ["IMHO"],"date_year":{"$gte":1982,"$lte":1991}},
    "aesthetic": {  "y": "WordsPerMillion",  "x": "date_year"  }
    }

Im not an expert on usenet. A history of usenet is here; Ian Milligan has written up his explorations of the Canadian Usenet. And Im not planning to do much more with this collection in the near future.

The major purpose of this is to start testing out some portable code for building a Bookworm on any set of e-mails. Loading this in to Bookworm was pretty easy: I simply took an hour or so and hacked together a Makefile to download the UTZOO files and run an ugly parser to pull out the metadata. (Usenet posts are close enough to e-mails that Pythons email.parser module happily extracts the metadata. Character encoding is a major problem, so Ive simply dropped everything that fails to parse as Unicode/ASCII). Once parsed, two lines of code write out the files in bookworm format, and the bookworm ingest handles all the messy details of tokenization, charting, and the rest. I later ran a second pass to better deal with some of the e-mail address formats; that will be useful for the next e-mail collections that I run it on. (Probably some particularly interesting e-mail list archivesthe Humanist, R-help, perhaps the entirety of H-net if I can download it all.)

The Usenet population

One obvious set of questions is where users are posting from. The top-level domains of posts give some idea of the ways that posters reached usenet. Some are familiar.uk, .com, .edu, but the Domain Name System didnt exist until 1985. Dominating the early years are UUCPusenets own transfer protocoland .arpa, going back to the ARPAnet itself.

Changing top-level domains

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "tld__id": {
            "$lte": 10
        },
        "date_year": {
            "$gte": 1983,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TotalTexts",
        "fill": "tld"
    }
}

Its a similar story for the mid-level domains; some easily recognizable present-day institutions, and some almost forgotten companies. MIT, Berkeley, and Hewlett-Packard produced the most usenet posters in this set. bio.net is in the top ten probably only because this is a biology-oriented collection. The Digital Equipment Corporation is one of the big losers; I dont know what CTS.comis, although it should be easy to look up.

Twenty largest mid-level domains

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "mld__id": {
            "$lte": 20
        }
    },
    "aesthetic": {
        "x": "TotalTexts",
        "y": "mld"
    }
}

After the DNS takes over, the general story is of expansion. Both for .com and .edu e-mail addresses, the most common domains (by percentage) fade away.

The top .com domains, as percentage of all .com posts.

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "tld":"com",
        "*mld__id": {
            "$lte": 30
        },
        "date_year": {
            "$gte": 1986,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TextPercent",
        "fill": "*mld"
    }
}

Relative uses of the top .edu domains, as percentage of all .edu posts.

{
    "database": "usenet",
    "plotType": "streamgraph",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "tld":"edu",
        "*mld__id": {
            "$lte": 30
        },
        "date_year": {
            "$gte": 1986,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_year",
        "y": "TextPercent",
        "fill": "*mld"
    }
}

Newsgroups

The real heart of usenet are the newsgroups. So what are they?

The most popular ones, it seems, are all computer oriented ones: for Apple, for Macintosh, for X-Windows, and for Atari. (Im surprised how long the Atari and Amiga usergroups stay fairly active.)

Top newsgroups by messages, as percentage of all messages

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {"*Newsgroups__id":{"$lte":100},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "TextPercent",
        "y": "*Newsgroups"
    }
}

One particularly useful feature is being able to search for words by newsgroup. Who uses paranoid the most? net.politics, naturally; but the Mac and Amiga groups have a surprisingly large number of uses, too. If you search for crazy, youll see that at least for Amiga this holds up.

Uses of paranoid (or a word of your choosing) in the 50 largest newsgroups

{
    "database": "usenet",
    "plotType": "barchart",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word":["paranoid"],
        "*Newsgroups__id":{"$lte":50},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "TextPercent",
        "y": "*Newsgroups"
    }
}

The next logical step is to juxtapose some of these against each other. So for example, we can look to see which newsgroups used the phrase IMHO the earliest.

Which newsgroups use IMHO the most and the earliest? Colors show percentage of emails in cell using the phrase; click to read.

{
    "database": "usenet",
    "plotType": "heatmap",
    "method": "return_json",
    "words_collation": "Case_Sensitive",
    "search_limits": {
        "word":["IMHO"],
        "Newsgroups__id":{"$lte":50},
        "date_month":{"$gte":3000},
        "date_year": {
            "$gte": 1981,
            "$lte": 1991
        }
    },
    "aesthetic": {
        "x": "date_month",
        "color": "TextPercent",
        "y": "Newsgroups"
    }
}

But this shows a massive data integrity issue! All of the net newsgroups are from 1983 to 1987; all of the comp newsgroups are from after about Christmas 1987.

Why is this? Maybe Ive done something wrong; or maybe it has to do with the way that the actual tapes were collected.

In any case, its just about enough to get me to give up.

If I were to spend some more time on this, I think the most interesting places would be net.politics, the slowing dying communities around Apple and Amiga products, and some looks at gender based on e-mail addresses particularly in the gender section, because the rise of usenet is at just the same time as women begin to vanish from computer science; even if its not causal, there should be some evidence to look at.

One last question: wheres the eternal September?

But let me finish with a strange little question. One of the reasons I set up this blog was to share some of the browser for data that Im not sure what to do with. Usenet is one of thoseI occasionally teach the computer culture of the 70s and 80s in history classes, but not to the degree I feel particularly confident making broad statements about whats out there. For instance: one of the things that we know about Usenet is that AOL ruined it in 1993 with the onset of the eternal September. The presumption there is that the character of Usenet changed each September with the influx of college students.

But in fact, I have a hard time finding obvious terms that spike in September. (Youll see a discontinuity in June: thats because 1991 is the heaviest-traffic year for the first half, but has no text for after the end of the spring semester.

This means that FAQ, the most obvious search for this field is almost unusable.

Where is the eternal September? Percentage of e-mails using netiquette by day of year, aggregate over the period 1982-1991

    { "database": "usenet",
    "plotType": "linechart",
    "search_limits": {"word": ["netiquette"],"date_year":{"$gte":1982,"$lte":1992},"date_day_year":{"$lte":366}},
    "aesthetic": {  "y": "TextPercent",  "x": "date_day_year"  }
    }