top of page
  • Writer's pictureConstantin Cheptea

The Art of Celebrating Too Early And Backtracking to Save The Day. A Scraping Story

How we doubled our data with a simple backtracking trick...

One of our clients asked us if we could build a simple app for scanning product barcodes in stores. The goal is to instantly tell the user if the product is safe for consumption considering specified intolerances (e.g.: Gluten). We would get the product information from a partner website. Only, there's a tiny issue: there's no public API to get the data. On the bright side, we were allowed to scrape it 🕷. |

// Our first scraper

Building a scraper may seem like a daunting task. There are many tools online that you can use, but after a thorough assessment we decided we can simplify things by using Elixir (functional programming language) and the Phoenix Framework. With Pheonix, we bootstrapped a basic admin panel in less than a day. The scraper itself is based on a tiny library called Crawly for making the web requests and Flow for parallelising the data processing. The products were not listing the barcodes but they did have high-quality images we could scan with ZBar. Image processing is slow in high-quality images - all the more reason to multithread. Long story short, our scraper turned out to be pretty slick 🚀.

// Time to make the client happy

We ran our scraper several times and made as many improvements as we could. Our database was ripe with thousands upon thousands of products ready to be picked by our soon-to-be long lasting customers. We were pretty confident we nailed it. Our client was pleased to find out we managed to pull the products without having any public API and took the app (built with ReactNative) for testing with friends and family. We went to celebrate with 🍻 and 🎮.

A couple of days later our client scheduled a meeting. Time for laurels, we thought. But no, the app turned out to be completely useless. Our client had scanned his entire fridge and found just two products that we also had in our database. Same went for the others. How was that possible? We did scrape the entire website, didn't we? Were there products that were not listed on the website?

// Needle in a haystack

Scraping is indeed a daunting task. After a day of uselessly testing and validating our code, we turned our heads towards the website. What if we looked for the products in the fridge? Namely, the ones we missed. Searching for them manually actually worked! They were on the website, only our scraper couldn't get to them.

Now the fun begins.

Our first discovery was that our website had a hard limit of 201 products per category. Meaning, no matter what category we were choosing to scrape we would always be getting the same 201 products. We scraped by category and dietary preference, but we also discovered that we could get completely different products if we applied different combinations of dietary preferences. This meant that if we were to make all possible combinations of dietary preferences we would end up with a lot more products if not all of them. But how do we solve that?

// Back to high school

First, we defined our problem. Suppose we have a list of elements, where each number represents a dietary preference. We need to come up with a function that finds all possible unique combinations of those elements. For list = [1, 2, 3], our function should return the following possibilities:

[1, 2]
[1, 3]
[1, 2, 3]
[2, 3]

Having simplified the terms of the problem we were relieved. Relieved because we knew what to do next, but also because all those years spent in college learning about algorithms are yet again useful (:feeling-grateful)!

Like children at play, we took the challenge to solve the problem individually, then choose the best solution. Time to code-kata!

Of course, none of us looked for anything fancy, because the scraper would only run once to get the data. An optimized brute force solution should be enough.

Here are some of the raw solutions:

# "To pad, or not to pad?", this is the question...

def kata1(input_list) do
    length = Enum.count(input_list)
    limit = (:math.pow(2, length) - 1) |> round |> IO.inspect()
    Enum.reduce(1..limit, [], fn c, acc ->
      digits = Integer.digits(c, 2)
      to_pad = length - Enum.count(digits)
      acc ++
      (if(to_pad == 0, do: [], else:, fn i -> 0 end)) ++ digits)
        |> Enum.filter(fn {x, y} -> x == 1 end)
        |> Enum.reduce([], fn {x, y}, acc -> acc ++ [y] end)
#Using the wisdom of bits.

def kata2(input_list) do
    length = Enum.count(input_list)
    # (2^length)-1 -> limit
    {limit, _} = "" 
    |> String.pad_leading(length, "1") |> Integer.parse(2)
    Enum.reduce(0..(limit - 1), [], fn c, acc ->
      acc ++
          String.split(List.to_string(:io_lib.format("~#{length}.2.0B", [-Bitwise.~~~(c)])), "",
          trim: true
          |> Enum.reverse()
          |> Enum.filter(fn {x, y} -> x == "1" end)
          |> Enum.reduce([], fn {x, y}, acc -> acc ++ [y] end)
# This one thinks it's all game of heads and tails.

defmodule Combinations do
    def combine([_|_] = list), do: combine(list, [])
    def combine([], _), do: []
    def combine([head | tail], current) do
        next = current ++ [head]
        [next] ++ combine(tail, next) ++ combine(tail, current)

We ended up choosing the last solution presented above, which is a classic backtracking implementation. You don't implement backtracking algorithms every day. So, although you (should) learn to use it in high school, chances are you forget about it by the time you get your first programming job. Our advice is to do competitive programming whenever you find the time or simply solve problems on platforms such as HackerRank and Codewars. It's fun and it keeps you fresh. Internally, we have a dedicated Slack channel where we add challenges and discuss solutions.

// To be continued...

After accommodating the backtracking solution with our existing code, it felt like our scraper was on steroids. The time to scrape the entire website doubled to around four hours. Clenching teeth, we regularly monitored the numbers in the database. In the end, it turned out well - the number of products almost doubled!

You may be curious about what happened with the project. Well, with the new data our client was happy to rescan his fridge and ask friends to do the same in real grocery stores. So far, the feedback was relatively positive. As a proof of concept, it does enough to start a discussion about opening the much-needed public APIs to make it work in real life.

For us, however, this was a short project stuffed with lots of fun challenges that required unusual, if not unorthodox, solutions.

// Constantin Cheptea // APR 16 2021


bottom of page