Queer Yahoo Groups Preservation Project
Note: A collection of redacted groups available for request is now up. More groups will be added as redaction is completed.
This project (with the assistance of community members) collects, archives, and preserves messages and select files posted to LGBTQ-related Yahoo Groups from their founding until late 2019. It emerged in response to Yahoo's Oct 16 announcement that the service would be shuttered and all archives deleted on December 14.
Interested group administrators may contact the QDHP about assistance in archiving their group history. These administrators (and their group) may also choose to donate a redacted copy of the group messages, alongside submitted files and images (excluding photographs), to be held by the Queer Digital History Project (see a copy of the donation form here). A further discussion of redaction is included in the "Ethical Considerations" section below.
Group archives are collected and held in two different ways, depending on their level of current activity and original access restrictions:
- Public, Non-Active (no substantive posts within the last six months) Groups: Messages to these groups may be collected, and a redacted copy of the group messages, without accompanying files or images, included in the list of Yahoo Groups Archives available for request.
- Public or Private Active Groups: These archives have been donated to the QDHP and are held with specific access restrictions.
Smaller, non-active groups (under 100 members) are not currently scheduled to be collected unless their administrator requests it.
The QDHP's archival copy will be held on a secure, password-protected hard drive, alongside the donation form (filled out at the time of donation). Individuals interested in accessing these archives would be able to contact the project directly to request a copy of one or more Yahoo Groups archives. Once the request is submitted, either the curatorial team or (should the donor have chosen this option at the time of donation) the donor/donating group will make a determination.
Messages are scraped from Yahoo Groups using a script developed by Kevin McCarthy. Initial output is in JSON, which is then converted using a Python 3.* script into e-mail readable format. Each thread is its own individual .txt file, entitled with its original Thread ID number and subject. This storage method, while less technical, aims to maximize human readability and file interoperability into the future.
This script also offers the option to redact messages. Redaction happens in three ways:
- Headers: Using the built-in email module, headers containing possible identifying information are deleted.
- E-mail addresses: E-mail addresses in both the body and headers are replaced with "<[EMAIL REDACTED]>" using an e-mail matching regular expression.
- First and Last Names (when used together) [To be released in updated parser]: Using the Stanford Named Entity Recognizer and nameparser, first and last names are identified in the body text, and last names are replaced with "[REDACTED]." (You can also test out the NER model online.)
The core ethical approach of the QDHP has been to preserve and collect files and documentation related to queer life online, while consistently prioritizing user agency and original expectations of privacy over preservation for preservation's sake. For this reason, donors (and later group members) may select the files and images (such as logos, event posters, etc.) they'd like included in the archive. Given both the difficulty of gaining subject consent and the sensitive nature of some groups' topical focus, the archives do not contain photographs. This choice represents a necessary loss out of respect to posters' likely original intent that these photos were for group members only.
The redaction process emphasizes posters' right to not have their original messages included in the archive, if they choose. At the time of donation, donors may include a list of e-mails whose attached messages should be excluded from the archive, and members of a group may at any point after initial donation contact the curatorial team (firstname.lastname@example.org) to request their e-mail(s) be added to the list.
However, redaction is not anonymization, and it does not aim to be such. Message text content may be preserved in posters' quotes of up-thread messages, especially depending on their mail client. However, redacting all e-mail addresses and last names aims to limit excluded posters' visibility within the archive. Ideally, redaction also limits the applicability of large-scale data analysis (such as name and e-mail scraping, etc) to archives.
Lastly, some public groups became the target of automated spam bombing within the last few years. These postings often included imagery and terms that were at odds with posters' own self-image. When these posts were not the source of larger discussion by members, they may not be included in the archive.