pang and lee movie review dataset

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

Latest commit

File metadata and controls, movie reviews dataset (pang and lee 2005).

Raw text from movie reviews of four critics comes from scaledata v1.0 dataset released by Pang and Lee ( http://www.cs.cornell.edu/people/pabo/movie-review-data/ ).

Preprocessing

Given plain text files of movie reviews, we tokenized and then stemmed using the Snowball stemmer from the nltk Python package, so that words with similar roots (e.g. film, films, filming) all become the same token. We removed all tokens in Mallet's list of common English stop words as well as any token included in the 1000 most common first names from the US census. We added this step after seeing too many common first names like Michael and Jennifer appear meaninglessly in many top-word lists for trained topics. We manually whitelisted "oscar" and "tony" due to their saliency to movie reviews sentiment. We then performed counts of all remaining tokens across the full raw corpus of 5006 documents, discarding any tokens that appear at least once in more than 20% of all documents or less than 30 distinct documents. The final vocabulary list has 5375 terms.

Each of the 5006 original documents was then reduced to this vocabulary set. We discarded any documents that were too short (less than 20 tokens), leaving 5005 documents. Each document has a binary label, where 0 indicates it has a negative review (below 0.6 in the original datasets' 0-1 scale) and 1 indicates positive review (>= 0.6). This 0.6 threshold matches a threshold previously used in the raw data's 4-category scale to separate 0 and 1 star reviews from 2 and 3 (of 3) star reviews. Data pairs ( $x_d, y_d$ ) were then split into training, validation, test. Both validation and test used 10 % of all documents, evenly balancing positive and negative labeled documents. The remaining documents were allocated to the training set.

Dataset Specs

Specs computed via

TRAIN set of movie_reviews

Valid set of movie_reviews, test set of movie_reviews.

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

huggingface/datasets -
facebookresearch/ParlAI -
allenai/allennlp-models -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Sst (stanford sentiment treebank).

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

Each phrase is labelled as either negative , somewhat negative , neutral , somewhat positive or positive . The corpus with all 5 labels is referred to as SST-5 or SST fine-grained. Binary classification experiments on full sentences ( negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) refer to the dataset as SST-2 or SST binary.

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove.

Similar Datasets

License edit, modalities edit, languages edit.

Navigation Menu