Solving the knapsack problem in PostgreSQL

Solving the knapsack problem in PostgreSQL

ยท

9 min read


title: Solving the knapsack problem in PostgreSQL published: true date: 2021-08-31 00:00:00 UTC tags: postgres, packing, analytics

canonical_url: aiven.io/blog/solving-the-knapsack-problem-..

The knapsack problem: how to use limited luggage space to pack only and all the items you're most likely to need? Enter PostgreSQL! Come and find out how to use the world's best open source database to help you pack.

Elephant on a Suitcase

The days before a holiday are exciting. You've planned a vacation somewhere and can't wait to hit the road. But first, there's a whole list of decisions to make: how many underpants should I bring? Are the running shoes necessary? What about the hairdryer? Which jacket should I pack? Lots of decisions to take with the suitcase size as constraint and the best holiday experience as ultimate goal.

The aim of the blog post is to help solve the packing problem using technology. Specifically, PostgreSQL and the recursive Common Table Expressions.

Combinatorial optimization - the knapsack problem

People have had to pack luggage for a long time, so this optimization dilemma is far from new. It even has a name: the knapsack problem. It can be applied to a variety of use cases where there is a set of items with a defined weight (space occupied in the luggage in our example) and value (item benefits during holiday) and where the total weight is limited (luggage size).

The end goal is to come up with a set of items that fits within the weight constraint and has the maximum possible value.

For example, let's assume we have the following items:

ItemValueWeight
๐Ÿงฆ Socks103
๐ŸŽฉ Hat155
๐Ÿ‘– Trousers515
๐Ÿ‘Ÿ Shoes810
๐Ÿ‘• T-Shirt710

And we have a knapsack with a capacity of 20.

This means we could pack:

  • ๐Ÿ‘– Trousers (weight 15) + ๐ŸŽฉ Hat (weight 5) for a total value of (5 + 15) = 20 OR
  • ๐Ÿ‘• T-Shirt (weight 10) + ๐ŸŽฉ Hat (weight 5) + ๐Ÿงฆ Socks (weight 3) for a total value of (7 + 15 + 10) = 32

The second option gives us a much better overall value based on the settings in the table.

(At this point, we should note that this concept and solution is presented as an oversimplified mathematical problem. Aiven is not responsible if you decide to leave for your holidays without any shoes or trousers!)

PostgreSQL to the rescue

How can we be sure to make the best choice? Can technology help? Of course! Just pull out your trusty PostgreSQL (which you should always remember to pack).

Let's start by creating an instance with the help of Aiven's command line interface.

avn service create demo-pg      \
    --service-type pg           \
    --cloud google-europe-west3 \
    --plan hobbyist

The above creates a PostgreSQL instance (--service-type pg) called demo-pg in the region google-europe-west3 with an hobbyist plan. You can size and move your instance as you wish; check out all the combinations available.

Once the instance is up and running (we can use avn service wait demo-pg to wait for it), we can log into the database from the prompt with:

avn service cli demo-pg

Great, we're in PostgreSQL! Now we can create the new inventory table and fill it with our dataset.

create table inventory (item_id serial, item_name varchar, value int, weight int);
insert into inventory(item_name, value, weight) values ('Socks', 10,3);
insert into inventory(item_name, value, weight) values ('Hat', 15,5);
insert into inventory(item_name, value, weight) values ('Trousers', 5,15);
insert into inventory(item_name, value, weight) values ('Shoes', 8,10);
insert into inventory(item_name, value, weight) values ('T-Shirt', 7,10);

Let's verify the data:

select * from inventory;

And we get the expected output:

 item_id | item_name | value | weight
---------+-----------+-------+--------
       1 | Socks     |    10 |      3
       2 | Hat       |    15 |      5
       3 | Trousers  |     5 |     15
       4 | Shoes     |     8 |     10
       5 | T-Shirt   |     7 |     10
(5 rows)

We're now ready to crack our luggage knapsack problem.

Iterative approach and PostgreSQL recursive CTEs

There are multiple ways of solving the knapsack problem. One of them is to replicate in PostgreSQL what you would do in a trial & error case: you pick one item, add it to the luggage, then pick another item and add it, going on like that until the suitcase is full. Once no more items can fit in, you sum all the values of the inserted items and take note of the list. Then you empty the luggage and start from scratch again, trying to make sure you don't end up with the same sequence of items just in a different order.

When all the possible item combinations have been checked, you can then select the one with maximum value. It's a long and error prone process, but luckily, computers can handle it for us.

The above process defines the steps we need to perform to find all the combinations:

  1. Select one random item from inventory.
  2. Add it to the picked_items list and store the item cost
  3. Select another item from inventory, making sure that: a. it wasn't already in the picked_items list and b. the weight of the current item plus the total_weight of all the items in the picked_items list is less or equal to the maximum allowed weight
  4. Add the new item in the picked_items list
  5. Add the item weight to the total_weight
  6. Go back to step #3 until inventory contains new items satisfying conditions a and b.

We can spot an iterative pattern here of sequential queries against the inventory table. How can we map it into PostgreSQL? We want to perform select statements against the inventory table for a variable number of iterations (the list of picked_items can have different sizes). We need a way to recursively call the select statement: here is were PostgreSQL recursive CTEs shine, since they allow the creation of queries that can reference their own output.

For example we can get all the combination of 3 items starting with Socks in our inventory table with:

WITH RECURSIVE items(picked_items, nr_items) as (
    SELECT ARRAY[item_name] as picked_items,
        1 nr_items
    from inventory
    where item_name = 'Socks'
    UNION ALL
    select picked_items || item_name,
        nr_items + 1
    from inventory cross join items
    where nr_items+1 <= 3
)
select * from items where nr_items=3;

The RECURSIVE keyword sets the scene. The second SELECT statement is what does the trick by joining the inventory table with the result of the RECURSIVE items query we are still defining. The result lists all the 25 combinations of 3 items in the inventory table starting with Socks.

            name             | nr_items
-----------------------------+----------
 {Socks,Socks,Socks}         |        3
 {Socks,Socks,Hat}           |        3
 {Socks,Socks,Trousers}      |        3
 {Socks,Socks,Shoes}         |        3
 {Socks,Socks,T-Shirt}       |        3
 {Socks,Hat,Socks}           |        3
 ...
 {Socks,T-Shirt,Hat}         |        3
 {Socks,T-Shirt,Trousers}    |        3
 {Socks,T-Shirt,Shoes}       |        3
 {Socks,T-Shirt,T-Shirt}     |        3
(25 rows)

Add the knapsack constraints

The above query returns all the combinations available in the inventory table limited to 3 items. However, our knapsack problems has different constraints:

  • Once an item is picked, we can't reuse it
  • The sequence doesn't have a fixed length but it's driven by the total_cost

Let's check how we can add the above constraints in the recursive query, keeping 20 as total allowed cost:

WITH RECURSIVE items(item_id, picked_items, nr_items, total_weight, total_value) as (
    SELECT
        item_id,
        ARRAY[item_name] as picked_items,
        1 nr_items,
        weight total_weight,
        value total_value
    from inventory
    UNION ALL
    select
        inventory.item_id,
        picked_items || item_name,
        nr_items + 1,
        weight + total_weight,
        value + total_value
    from inventory cross join items
    where
        picked_items::varchar[] @> ARRAY[item_name] = false
        and weight + total_weight <= 20
        and inventory.item_id > items.item_id
)
select * from items order by total_value;

Let's cut the query down in sections: the first piece is the seed of our recursive query, we select all rows from the inventory table

SELECT
    item_id,
    ARRAY[item_name] as picked_items,
    1 nr_items,
    weight total_weight,
    value total_value
from inventory

We store the item_id as single value and we add it to the picked_items array. We also store the total number of items (1), the associated weight as total_weight and value as total_value. The next phase, after the UNION ALL, is the recursive bit:

select
    inventory.item_id,
    picked_items || item_name,
    nr_items + 1,
    weight + total_weight,
    value + total_value
from inventory cross join items
where
    picked_items::varchar[] @> ARRAY[item_name] = false
    and weight + total_weight <= 20
    and inventory.item_id > items.item_id

We select the item_id, store it a single value and add it to the picked_items array. Add +1 to count of items and add the weight and value of the current item to total_weight and total_value of the previously selected items respectively.

The interesting bit comes in the following section. We perform a CROSS JOIN between inventory and items (our recursive query) creating a cartesian product of the tables. In other words, we then have a row for each item that exists in inventory multiplied for each row in items. Not very clever, but the filtering does the magic:

  • picked_items::varchar[] @> ARRAY[item_name] = false: makes sure that we exclude from the result set any rows for which the item_name is already in the list of picked_items
  • weight + total_weight <= 20: discards from the result set all items for which the related weight summed to the previously calculated total_weight exceeds the 20 threshold we set for our luggage.
  • inventory.item_id > items.item_id: it's an optimization filter; {Trousers,Shoes} and {Shoes,Trousers} are two combinations of the same items, and we want to keep only one. Thus we allow the selection of items only in ascending item_id order, this will help improving performances since we are not analysing the same subset of items multiple times.

The result of the full query above is:

 item_id |       picked_items        | nr_items | total_weight | total_value
---------+---------------------------+----------+--------------+-------------
       3 | {Trousers}                |        1 |           15 |           5
       5 | {T-Shirt}                 |        1 |           10 |           7
       4 | {Shoes}                   |        1 |           10 |           8
       1 | {Socks}                   |        1 |            3 |          10
       2 | {Hat}                     |        1 |            5 |          15
       5 | {Shoes,T-Shirt}           |        2 |           20 |          15
       3 | {Socks,Trousers}          |        2 |           18 |          15
       5 | {Socks,T-Shirt}           |        2 |           13 |          17
       4 | {Socks,Shoes}             |        2 |           13 |          18
       3 | {Hat,Trousers}            |        2 |           20 |          20
       5 | {Hat,T-Shirt}             |        2 |           15 |          22
       4 | {Hat,Shoes}               |        2 |           15 |          23
       2 | {Socks,Hat}               |        2 |            8 |          25
       5 | {Socks,Hat,T-Shirt}       |        3 |           18 |          32
       4 | {Socks,Hat,Shoes}         |        3 |           18 |          33
(15 rows)

It shows all the combinations of items that fit in our 20 capacity luggage. When we sort it by total_value, it becomes evident that the combination {Socks,Hat,Shoes} is the one that provides the maximum value.

Again, please take this as pure theoretical exercise and not as holiday packing suggestion. What? do you have also other constrains to add? Taking this as a basis, feel free to add any additional complexity that your real case requires!

Some thoughts on knapsack problem and PostgreSQL recursive CTEs

The knapsack problem is very common: from project planning to resource allocation there is a variety of real cases where we need to fit as much items as possible in some sort of limited bucket. PostgreSQL recursive CTEs represent a valid option where a query iterative approach is needed.

A word of caution here: performance can quickly go wild if the recursion is not properly limited by filters. Define your recursion, set your boundaries, and enjoy how PostgreSQL can help you packing before the holidays!

Some more references:

ย