Curated and augmented data for 44,953 legislative speeches from the National Constituent Assembly during the French Revolution. Companion to PNAS article Individuals, institutions, and innovation in the debates of the French Revolution.
This corpus was created from the French Revolution Digital Archive (FRDA), a digitization of the Archives Parlementaires (AP) made available through the efforts of Stanford University Libraries and the Bibliothèque nationale de France. This data contains the FRDA's OCR-generated text from a subset of speeches made during the National Constituent Assembly, the first legislative body of the French Revolution. Each speech is augmented with date correction, speaker disambiguation, legislative role markers, political affiliation, and class membership. See the column guide for more detail. Also provided is the topic model trained from these speeches and used in the PNAS article.
See FRevNCA_CuratedData.ipynb
for details on each file below.
FRevNCA_speechdata.txt.gz
: contains raw and processed speech text, speaker information, and metadata. utf-8 encoded, with '=+=' column delimiters and newline row delimiters, gzipped with level 9 compression.FRevNCA_ProcessedVocabText_topics.gz
: topics trained from speech data via Latent Dirichlet Allocation.FRevNCA_ProcessedVocabText_topicmixtures.gz
: topic mixtures associated with the topics aboveFRevNCA_ProcessedVocabText_vocabbasis.txt.gz
: vocabulary basis for the topics
FRevNCA_CuratedData.ipynb
: loads and describes data.
These are a curated and augmented subset of data obtained originally from the publicly available xml files posted on Stanford's FRDA website, retrieved for this work circa ~2015. The FRDA web interface has changed since then, but metadata relevant to the original xml remains for completeness.
NCASpeechId
: universal speech index used for all data.Date
: date of the speech. These were cleaned and corrected from the original, which had errors in order and in formatting.OrigFile
: original xml file.Volume
: original volume of the AP.PbTagId
: location id used throughout the original xml, useful for old FRDA web interface or working with original xml files. The speech falls after this PbTagId and before the next, in AP page order.PageNum
: page of the AP on which the speech occurs.SpeakerStr
: speaker string provided by the FRDA xml.Surname
andName
: identities disambiguated from all the SpeakerStrs. These are the ones used in the PNAS analysis. Note: although a lot of manual attention produced these attributions, they are not guaranteed 100% accurate! There was significant noise in the SpeakerStr data - see the Supplementary Material, Preparing and characterizing speech data section, for more detail. "nomatch" indicates the speech'sSpeakerStr
was not assigned to a disambiguated entity.Affiliation
: "g" (gauche), "d" (droite), "nonpos" (matched identity isn't positively identified as gauche or droite according to our historian co-author), or "nomatch" (no identity match was made toSpeakerStr
).Estate
, 1st/2nd/3rd estate, or "nonpos"/"nomatch" as forAffiliation
.Club
: an assortment of political clubs to which individuals belonged, or "nonpos"/"nomatch".President
: binary presidential speech indicator.CommitteeStatus
: "newitem" (speaker as committee proxy introduces a decree proposal to the floor), "indebate" (committee proxy speaks in the midst of debate), or "noncomm" (speaker is not a committee proxy).RawTextFr
: The raw speech text obtained from the original xml.RawTextEnTrans
: For giggles, I made a script circa ~2016 that queries Google Translate with all of the raw speeches. Results included here.ProcessedText
:RawTextFr
after light tokenization.ProcessedVocabText
:ProcessedText
after removing words with fewer than 3 characters, stop words, then limiting to a 10,000-word vocabulary by highest observed frequency.
- python (3.7.10)
- numpy (1.18.5)
- pandas (1.3.5)