Movies and TV

Source

The Internet Movie Database (IMDb) provides these data sets:

These data sets are refreshed on a daily basis.

Description of the Data

There are 7 data files provided by IMDb:

name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz

The "Rotten Tomatoes movies and critic reviews dataset" is likely from here:

The Office dialogue likely is from here:

Transformations to the original data source

Kevin performed several transformation of the data.

He created comma-separated versions of the data from the IMDb tsv files:

akas.csv, crew.csv, episodes.csv, people.csv, ratings.csv, titles.csv

He also created a database file:

imdb.db

These two files are from the Rotten Tomatoes dataset:

rotten_tomatoes_movies.csv, rotten_tomatoes_reviews.csv

The dialogues from The Office are stored in:

the_office_dialogue.csv

Note that the source website mentions "55130 observations of 12 variables" but there seems to be 1 line in the csv file that stretches onto 3 lines (i.e., it has 2 extra lines). There is also the header line. Thus, the file the_office_dialogue.csv has a total of 55133 lines.

We can download the IMDb data using: