Datasets¶

General¶

1 Billion Word Language Model Benchmark: The purpose of the project is to make available a standard training and test setup for language modeling experiments: [Link]
Common Crawl: The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions: [Link]
Yelp Open Dataset: A subset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes: [Link]

20 newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups: [Link]
Broadcast News The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts: [Link]
The wikitext long term dependency language modeling dataset: A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. : [Link]

Question Answering Corpus by Deep Mind and Oxford which is two new corpora of roughly a million news stories with associated queries from the CNN and Daily Mail websites. [Link]
Stanford Question Answering Dataset (SQuAD) consisting of questions posed by crowdworkers on a set of Wikipedia articles: [Link]
Amazon question/answer data contains Question and Answer data from Amazon, totaling around 1.4 million answered questions: [Link]

Multi-Domain Sentiment Dataset TThe Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains): [Link]
Stanford Sentiment Treebank Dataset The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language: [Link]
Large Movie Review Dataset: This is a dataset for binary sentiment classification: [Link]

Aligned Hansards of the 36th Parliament of Canada dataset contains 1.3 million pairs of aligned text chunks: [Link]
Europarl: A Parallel Corpus for Statistical Machine Translation dataset extracted from the proceedings of the European Parliament: [Link]

Legal Case Reports Data Set as a textual corpus of 4000 legal cases for automatic summarization and citation analysis.: [Link]