title: Analysing Netflix shows with pgAdmin and PostgreSQL published: true date: 2021-05-20 00:00:00 UTC tags: pgAdmin, postgreSQL, netflix, analytics
canonical_url: aiven.io/blog/analysing-netflix-shows-with-..
PostgreSQL is the perfect database for a huge variety of business cases, from IoT-type fast insertions to bulk analytics workloads. We even make it possible to explore exoplanet data!
Database interaction can be performed at various levels: command line tools are great for people glued to their keyboards... but some people prefer a slick point-and-click UI that shows all the information without having to write tons of commands. In this blog post, we'll look at how to use Aiven for PostgreSQL with pgAdmin, one of the most popular tools for administration and development.
Here's what this post will cover:
- Aiven PostgreSQL service creation, since you can only query a database if it exists
- How to install pgAdmin and connect to a PostgreSQL database
- How to load a Netflix dataset into PostgreSQL via pgAdmin
- How to check Samuel L. Jackson statistics via pgAdmin queries
1. Create a PostgreSQL database
If you don't have a PostgreSQL database ready, why not create one with Aiven's console? Here's how:
- In the Aiven Console, click + Create a new service
- As Product, select PostgreSQL
- Select a cloud provider, a region and a plan.
Once created, you'll be able to find the information about your database by double-clicking on the instance name. The Overview tab shows information like Database Name, Host and Port, together with the pre-created
avnadmin
user and password.
We just need to wait a couple of minutes until the Nodes lights become green, meaning our PostgreSQL service is ready to be used.
2. Install and Connect pgAdmin
I chose pgAdmin because it's one of the most popular choices amongst PostgreSQL client tools. It can run either as a web or desktop application, and for the aim of this blog post we'll choose the latter. You can download it from the pgAdmin website, and the installation just takes a couple of clicks.
Once installed and launched, the tool asks you to set a Master Password. This is a good way to secure all the credentials we're going to store.
Now it's time to connect to PostgreSQL. To do that, click Create New Server and fill in the required parameters. On the General tab, set the connection name; on the Connection tab, enter the hostname, port, and maintenance database as well as username and password used to authenticate. (All this information available in the Aiven Console on the Overview tab). Finally, on the SSL tab, select SSL Mode = required
; this is Aiven's default security method.
We should now be able to access our PostgreSQL default screen. It contains a set of visualisations around number of sessions, transactions per second, in and out tuples and server activity. Very useful information for monitoring our database.
pgAdmin offers a wide set of administration features, from local database backups management to Schema Diff, a valid tool for enabling a DDL comparison across databases.
3. Load Netflix data
pgAdmin is not only a monitoring and administration tool. Since it offers also development experience, we can use it to create any kind of objects in the database via GUI or pgsql. For example, you can create a table, upload in it a csv file and start querying it.
We'll use some data related to Neflix shows, taken from Kaggle. You only need a free account to download it. We can then unzip it and we'll use the file netflix_titles.csv
for the rest of the blog.
We have the dataset, what about the table structure? We can generate the table DDL with the handy ddlgenerator
tool as explained in a previous post all we have to do is to run the following from a terminal window
pip install ddl
ddlgenerator postgres ~/Downloads/netflix_titles.csv
The above command should provide the following output, which we can now copy and paste in pgAdmin query editor available by selecting Tools -> Query Tools and run it.
CREATE TABLE netflix_titles (
show_id VARCHAR(5) NOT NULL,
type VARCHAR(7) NOT NULL,
title VARCHAR(104) NOT NULL,
director VARCHAR(208),
_cast VARCHAR(771),
country VARCHAR(123),
date_added TIMESTAMP WITHOUT TIME ZONE,
release_year BIGINT NOT NULL,
rating VARCHAR(8),
duration VARCHAR(10) NOT NULL,
listed_in VARCHAR(79) NOT NULL,
description VARCHAR(248) NOT NULL
);
Browse the object tree in the pgAdmin left-side panel to Default-db -> Schemas -> Public -> Tables, and you should see a table named netflix_titles
.
To load the csv file into the netflix_titles
table, do the following:
- Right-click on the
netflix_titles
table and selectImport/Export
. - Set
Import
in the Import/Export slider. - Select the
netflix_titles.csv
file from the local computer - Enable the
Header
flag. - Click OK.
You can see your import thread being started and populating the netflix_titles
table. Once finished, pgAdmin shows the text Successfully completed
.
4. Have fun with the data!
Now that we loaded the data, we can have some fun querying it. Let's create a new query editor and check which were the top 3 countries based on movie production according to the dataset with the following query:
select country,
count(*) nr_movies
from netflix_titles
group by country
order by 2 desc
limit 3;
The result, unsurprisingly, puts United States on top, followed by India - and then Null, probably due to problems during the data collection.
Now it's time to explore a bit the filmography of one of my favourite actors: Samuel L. Jackson
. Which actor was in most movies with him?
select trim(s.actor) actor,
count(*) nr_movies
from netflix_titles nt,
unnest(string_to_array(nt._cast, ',')) s(actor)
where _cast like '%Samuel L. Jackson%'
and trim(s.actor) <> 'Samuel L. Jackson'
group by trim(s.actor)
order by 2 desc
limit 10;
The following query allows us to explode, with the unnest
function, the _cast
field containing a comma separated list of actors for each movie. The result tells us that Tim Roth and Walton Goggins are the two lucky actors who shoot three movies with Samuel L. Jackson.
Finally, what are the 10 most used words (with more than 3 characters) in movies titles?
select title_word, count(*) nr_movies
from netflix_titles nt,
unnest(string_to_array(upper(nt.title), ' ')) s(title_word)
where char_length(title_word) > 3
group by title_word
order by 2 desc
limit 10;
Again, the unnest
function over the title
field enables us to verify that, as expected, LOVE is the most frequently used word. But run this query yourself to find out which surprising word takes third place!
pgAdmin is a perfect fit for PostgreSQL
pgAdmin offers a great UI for PostgreSQL: monitoring the database, comparing changes across instances, managing local backups, query, import and export of data. The tool offers a series of functionalities which make developer and administrators life easier and daily tasks accessible. If you want more information here are few links:
- pgAdmin website where to find downloads, documentation and FAQs
- Aiven for PostgreSQL providing details about our offering
- Getting started with PostgreSQL containing useful information on how to create a service and interact with it in a variety of languages
- Aiven's console where you can create your favourite open source data platform