A question for Google people -- Usenet dump?

Here's a question for data-liberation people in Google. (I know some people who work for Google, but I'm not in contact with whoever can directly answer this.)

For some fifteen years, most of the online discussion about interactive fiction -- probably most of the IF discussion, period -- happened on two Usenet groups: rec.arts.int-fiction and rec.games.int-fiction.

We have archives of those discussions from 1992-1997 and some of 1999-2002. (See IFArchive directories for RAIF and RGIF.) Outside those ranges, we rely on Google and its Groups service -- as you can tell from my two links above.

Google Groups has historically been iffy about Usenet. It started by acquiring the Deja News post archive (which itself only started in 1995, and was not completely preserved). Google's Groups service was then built on top of that -- rather in the sense of a rhinoceros being built on top of an old rollerskate -- and its Usenet access dwindled in priority. Its indexing was famously gappy for many years, although Google fixed that a couple of years ago.

I could get into a long post about Google's treatment of Usenet and its long-term consequences, but that's not this post. My question: we, the IF community, would like to hold our own data here. What's the best way for me to get a complete dump of all messages posted to those two Usenet groups, ever?

Scraping through the Google Groups web interface is a way to do this, but it's not very good, for a couple of reasons. (a) Google tends to shut down automated trawlers after some number of requests. (b) I'd have to deal with an extra layer of content encoding, which is more room for encoding to go wrong. (c) I don't know if Google's indexing is really complete, even now.

So it would be way better if some nice Google person could tap it at the source and send me a tar file. Or a DVD, or a hard drive, whatever. Anybody?

The qualifications:

Obviously there's no such thing as complete. I'll take whatever Google has, and merge it with the Archive records.
I mean all posts with either rec.arts.int-fiction or rec.games.int-fiction in the Newsgroups: line. I also want all the crossposts, including the off-topic ones, the ones troll-crossposted to a zillion irrelevant groups, all of them. Think Newsgroups:.*rec\.(arts|games)\.int-fiction.*
I think I want spam, too. Probably. It depends on how much spam there is. (Google's index lets through a lot of spam, but maybe there's a thousand times as much which it doesn't show.) Tell me if it's horrible, we'll discuss it.
Original post file format, if possible.
My intent is to take whatever I get, ball it up, and stick it on the Archive. Then (at some point, not necessarily soon) I will go through, cull out the off-topic trolls and spam, and post it as a nice browsable web site on the Archive. Or maybe somebody else will do that part. Collect data first; massage later.
This is a one-shot request, as discussion on those newsgroups has mostly (not entirely) ceased as of a couple of years ago. A dump from beginning-of-time through this month is fine. (The community has shifted to intfiction.org these days. Archiving the web forum is a separate topic, which I also have feelers out on.)

If you can help, please comment here, or email me (erkyrath@eblong.com). Thanks.

9 Responses to A question for Google people -- Usenet dump?

Rick Reynolds says:

November 3, 2012 at 7:44 PM

Thanks for taking this up, Zarf. Definitely needed, and it is something I would love to have access to.

Eriorg says:

November 3, 2012 at 8:59 PM

Yes, it's a great idea!

You're also talking about archiving the intfiction.org forum, and I know you archived the IFDB. Do you plan to archive the IFWiki? There's a lot of content there too, and I really wouldn't like it if it was lost.

Rowan says:

November 3, 2012 at 10:12 PM

I bet if Jason Scott and Archive Team don't have their fingers on the pulse of this one, they know who does.

Andrew Plotkin says:

November 3, 2012 at 11:29 PM

IFWiki! Good point. I'll add that to my list.

If Jason is the one who answers this question, great. If it's someone else, also great. :)

Ben Collins-Sussman says:

November 5, 2012 at 8:25 AM

Andrew, my good buddy Brian Fitzpatrick works with me right in the Google Chicago office, and is the founder of Google's Data Liberation Front. I'll have a chat with him about this today. He's quite familiar with IF. :-)

Andrew Plotkin says:

November 5, 2012 at 11:48 AM

Thanks. Let me know.

Andrew Plotkin says:

November 9, 2012 at 12:36 AM

A source has been found. :) More details this weekend.

- Andrew Plotkin says:
  
  November 13, 2012 at 1:27 PM
  
  Someone at Google sent me a post dump, although it's not entirely clear how complete it is. Details: http://www.intfiction.org/forum/viewtopic.php?f=4&t=6259
  
  I am grateful to add it to the collection. Still haven't done the Big Merge, though.
  
usenet search says:

November 13, 2012 at 5:12 AM

I prefer to use professional usenet indexer to find files in usenet, I do not use Google for it.

The Gameshelf