Analysing Netflix shows with pgAdmin and PostgreSQL

Analysing Netflix shows with pgAdmin and PostgreSQL


title: Analysing Netflix shows with pgAdmin and PostgreSQL published: true date: 2021-05-20 00:00:00 UTC tags: pgAdmin, postgreSQL, netflix, analytics

canonical_url: aiven.io/blog/analysing-netflix-shows-with-..

PG and chill

PostgreSQL is the perfect database for a huge variety of business cases, from IoT-type fast insertions to bulk analytics workloads. We even make it possible to explore exoplanet data!

Database interaction can be performed at various levels: command line tools are great for people glued to their keyboards... but some people prefer a slick point-and-click UI that shows all the information without having to write tons of commands. In this blog post, we'll look at how to use Aiven for PostgreSQL with pgAdmin, one of the most popular tools for administration and development.

Here's what this post will cover:

  1. Aiven PostgreSQL service creation, since you can only query a database if it exists
  2. How to install pgAdmin and connect to a PostgreSQL database
  3. How to load a Netflix dataset into PostgreSQL via pgAdmin
  4. How to check Samuel L. Jackson statistics via pgAdmin queries

1. Create a PostgreSQL database

If you don't have a PostgreSQL database ready, why not create one with Aiven's console? Here's how:

  • In the Aiven Console, click + Create a new service
  • As Product, select PostgreSQL
  • Select a cloud provider, a region and a plan. Once created, you'll be able to find the information about your database by double-clicking on the instance name. The Overview tab shows information like Database Name, Host and Port, together with the pre-created avnadmin user and password.

Overview Tab

We just need to wait a couple of minutes until the Nodes lights become green, meaning our PostgreSQL service is ready to be used.

2. Install and Connect pgAdmin

I chose pgAdmin because it's one of the most popular choices amongst PostgreSQL client tools. It can run either as a web or desktop application, and for the aim of this blog post we'll choose the latter. You can download it from the pgAdmin website, and the installation just takes a couple of clicks.

Once installed and launched, the tool asks you to set a Master Password. This is a good way to secure all the credentials we're going to store.

Now it's time to connect to PostgreSQL. To do that, click Create New Server and fill in the required parameters. On the General tab, set the connection name; on the Connection tab, enter the hostname, port, and maintenance database as well as username and password used to authenticate. (All this information available in the Aiven Console on the Overview tab). Finally, on the SSL tab, select SSL Mode = required; this is Aiven's default security method.

Connection Properties

We should now be able to access our PostgreSQL default screen. It contains a set of visualisations around number of sessions, transactions per second, in and out tuples and server activity. Very useful information for monitoring our database.

Default Screen

pgAdmin offers a wide set of administration features, from local database backups management to Schema Diff, a valid tool for enabling a DDL comparison across databases.

3. Load Netflix data

pgAdmin is not only a monitoring and administration tool. Since it offers also development experience, we can use it to create any kind of objects in the database via GUI or pgsql. For example, you can create a table, upload in it a csv file and start querying it.

We'll use some data related to Neflix shows, taken from Kaggle. You only need a free account to download it. We can then unzip it and we'll use the file netflix_titles.csv for the rest of the blog.

We have the dataset, what about the table structure? We can generate the table DDL with the handy ddlgenerator tool as explained in a previous post all we have to do is to run the following from a terminal window

pip install ddl
ddlgenerator postgres ~/Downloads/netflix_titles.csv

The above command should provide the following output, which we can now copy and paste in pgAdmin query editor available by selecting Tools -> Query Tools and run it.

CREATE TABLE netflix_titles (
    show_id VARCHAR(5) NOT NULL,
    type VARCHAR(7) NOT NULL,
    title VARCHAR(104) NOT NULL,
    director VARCHAR(208),
    _cast VARCHAR(771),
    country VARCHAR(123),
    date_added TIMESTAMP WITHOUT TIME ZONE,
    release_year BIGINT NOT NULL,
    rating VARCHAR(8),
    duration VARCHAR(10) NOT NULL,
    listed_in VARCHAR(79) NOT NULL,
    description VARCHAR(248) NOT NULL
);

Browse the object tree in the pgAdmin left-side panel to Default-db -> Schemas -> Public -> Tables, and you should see a table named netflix_titles.

To load the csv file into the netflix_titles table, do the following:

  1. Right-click on the netflix_titles table and select Import/Export.
  2. Set Import in the Import/Export slider.
  3. Select the netflix_titles.csv file from the local computer
  4. Enable the Header flag.
  5. Click OK.

Import options

You can see your import thread being started and populating the netflix_titles table. Once finished, pgAdmin shows the text Successfully completed.

4. Have fun with the data!

Now that we loaded the data, we can have some fun querying it. Let's create a new query editor and check which were the top 3 countries based on movie production according to the dataset with the following query:

select country,
  count(*) nr_movies
from netflix_titles
group by country
order by 2 desc
limit 3;

The result, unsurprisingly, puts United States on top, followed by India - and then Null, probably due to problems during the data collection.

Now it's time to explore a bit the filmography of one of my favourite actors: Samuel L. Jackson. Which actor was in most movies with him?

select trim(s.actor) actor,
    count(*) nr_movies
from netflix_titles nt,
    unnest(string_to_array(nt._cast, ',')) s(actor)
where _cast like '%Samuel L. Jackson%'
    and trim(s.actor) <> 'Samuel L. Jackson'
group by trim(s.actor)
order by 2 desc
limit 10;

The following query allows us to explode, with the unnest function, the _cast field containing a comma separated list of actors for each movie. The result tells us that Tim Roth and Walton Goggins are the two lucky actors who shoot three movies with Samuel L. Jackson.

Finally, what are the 10 most used words (with more than 3 characters) in movies titles?

select title_word, count(*) nr_movies
from netflix_titles nt,
    unnest(string_to_array(upper(nt.title), ' ')) s(title_word)
where char_length(title_word) > 3
group by title_word
order by 2 desc
limit 10;

Again, the unnest function over the title field enables us to verify that, as expected, LOVE is the most frequently used word. But run this query yourself to find out which surprising word takes third place!

pgAdmin is a perfect fit for PostgreSQL

pgAdmin offers a great UI for PostgreSQL: monitoring the database, comparing changes across instances, managing local backups, query, import and export of data. The tool offers a series of functionalities which make developer and administrators life easier and daily tasks accessible. If you want more information here are few links: