TDM 20100: Project 11 — 2022
Motivation: Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious.
Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.
Scope: SQL, sqlite, joins
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/movies_and_tv/imdb.db
In addition, the following is an illustration of the database to help you understand the data.
For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.
To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.
%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db
For every following cell where you want to run a SQL query, prepend %%sql
to the top of the cell — just like we do for R or bash cells.
Questions
Question 1
In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year.
Now that we are learning about joins, we have the ability to make much more interesting queries!
Use the provided list of title_id
values to get a list of the MCU movie primary_title
values, premiered
values, and rating (from the provided list of MCU movies).
Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list).
('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286')
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Run the following query.
%%sql
SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286');
Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1).
Write a query that returns the primary_title
of those movies not shown in the result of question (1) but that are shown in the result of the query above. You can use the query in question (1) as a subquery to answer this.
Can you notice a pattern to said movies?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
In the previous questions we explored what is actually the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1).
Modify your query from question (1) so that it returns a list of all MCU movies with their associated rating, regardless of whether or not the movie has a rating.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
In the previous project, question (5) asked you to write a query that returns the average number of words in the primary_title
column, by year, and only for years where the average number of words in the primary_title
is less than 3.
Okay, great. What would be more interesting would be to see the average number of words in the primary_title
column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have?
Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different?
-
Code used to solve this problem.
-
Output from running the code.
Question 5
We have a fun database, and you’ve learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn’t previously (easily) tackle, and answers a question you are interested in.
Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |