Datasets¶
General¶
- 1 Billion Word Language Model Benchmark: The purpose of the project is to make available a standard training and test setup for language modeling experiments: [Link]
- Common Crawl: The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions: [Link]
- Yelp Open Dataset: A subset of Yelp’s businesses, reviews, and user data for use in personal, educational, and academic purposes: [Link]
Text classification¶
- 20 newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups: [Link]
- Broadcast News The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts: [Link]
- The wikitext long term dependency language modeling dataset: A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. : [Link]
Question Answering¶
- Question Answering Corpus by Deep Mind and Oxford which is two new corpora of roughly a million news stories with associated queries from the CNN and Daily Mail websites. [Link]
- Stanford Question Answering Dataset (SQuAD) consisting of questions posed by crowdworkers on a set of Wikipedia articles: [Link]
- Amazon question/answer data contains Question and Answer data from Amazon, totaling around 1.4 million answered questions: [Link]
Sentiment Analysis¶
- Multi-Domain Sentiment Dataset TThe Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains): [Link]
- Stanford Sentiment Treebank Dataset The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language: [Link]
- Large Movie Review Dataset: This is a dataset for binary sentiment classification: [Link]