Public Web Data Collection is a Constant Cat and Mouse Game

Published on:

19 Aug 2022, 6:20 am

Data collection experts David Cohen and Paul Morgan share challenges of their daily work

Behind the constant and reliable flow of web data that so many businesses depend on, there's always a team of tireless developers, who ensure everything runs smoothly. That "smoothness" is not always that easy to achieve, say David Cohen, Director of Engineering, and Paul Morgan, Technical Team Lead of the Data Collection team at Datasembly, an industry-leading provider of real-time, hyper-local pricing, promotion, and assortment data for CPGs and retailers. According to them, every day the data collection world is full of unexpected challenges and situations that require quick and focused attention.

Daily challenges of web scraping will be at the forefront of this year's OxyCon conference. Ahead of the event, Paul and David share their views on the industry.

How did you get introduced to data collection and web scraping? What was the hardest and easiest aspects of the transition from app and website development to the creation of data collection architecture?

David: My first experience with web scraping was a personal project to collect data from a popular rock climbing logbook site. Essentially, I wanted to collect all of the data to open source it for analysis, with the eventual goal of building a demographic data-based climbing route recommendation engine. Some articles based on analyses of the data were published in climbing magazines, which led to an interesting and contentious interaction with the owners of the logbook site.

Paul: I was not at all familiar with web scraping before starting work with Datasembly. They saw the technical potential in me and the willingness and ability to learn, which gave them confidence that I would be able to learn the scraping ropes as I progressed.

For me, the easiest aspect of transitioning from app and web development to scraping data from apps and websites was the foundational knowledge of how they were typically built. Understanding where and how data is loaded proved exceptionally helpful when it came to dissecting and analyzing websites as opposed to building them.

The hard part about transitioning from app/web development to scraping is the lack of a clearly defined final goal. With scraping the target is constantly moving; APIs are changing, apps are being updated, bot detection techniques are improving, page layouts are evolving – there's a constant battle between spending time making sure all current scrape jobs are running smoothly and finding the time to improve and harden the overall data collection architecture.

What were the greatest challenges when you started and how have they evolved since then? Are there any specificities that only exist in ecommerce websites?

David: In the four and a half years that I've been with Datasembly, the data collection landscape has seen some pretty serious evolution. Early on, the greatest challenges were based around fairly rudimentary Javascript challenges, headless browser cloaking, and HTTP header management. Nowadays, we see much more sophisticated bot management techniques that combine elements of TLS fingerprinting, HTTP request construction, single-use cookies, etc.

While we do collect eCommerce data, our bread and butter are really in-store retail data (collected via web scraping). The biggest specificity in that sense is likely setting location/store correctly programmatically. There are some pitfalls that are easy to fall into, like cookies that silently expire and cause the site to return data for the default location rather than the intended location.

Paul: Learning the nuances of bot detection and methods of circumventing blockages was a challenge when first starting. Over time, after overcoming many different types of blockages on a wide variety of websites and apps, you begin to develop a toolbox of techniques and things to look for that are very helpful.

One of the more challenging things we have noticed with eCommerce websites is pricing regionality. Some websites operate under the assumption that "ecommerce" prices are the same nationwide. However, with other websites this doesn't hold true. Products will be sold as "ecommerce" but have prices that differ from region to region.

What strategies do you use to ensure proper functionality throughout a constantly scaling and changing data collection endeavor? How do you allocate resources between deployment, scheduling, collection, and data delivery?

Paul: All of the strategies we use to ensure proper functionality in a constantly evolving data collection environment revolve around data analysis, logging, and monitoring. Every week (and even several times throughout the week) we are doing comparisons between currently collected data and historical data to catch any anomalies. In addition to this, we have developed systems that give us a heartbeat and other vitals of currently running jobs, so we're able to detect quickly when something isn't working as expected and address the issue within the day, instead of needing to wait until the scrape cycle is complete to notice the problem.

On the data collection team here at Datasembly, our developers wear many hats. In addition to working in many languages (Scala, Rust, Python, Javascript, Go – to name a few) the team also manages the deployments, scheduling, and collection required to capture the data from a webpage before passing the data off to our downstream engineering teams for delivery to a satisfied customer. Here at Datasembly, satisfying the customer is extremely important to us, which is evident in the 100% customer retention rate we've seen over the last few years.

What has been the strangest and most unusual issue you have encountered during your career in data collection?

David: Sometimes beating bot management engines feels like strange, dark magic. Beyond that, geofencing and similar location-based access restriction techniques are always fun. We have seen cases of mobile apps not displaying accurate in-store pricing unless the user is actually in the store. Triggering and observing the access restriction behavior is not so difficult when it is purely based on location, but when it starts to involve Wi-Fi network presence and other more obscure techniques, it can be a little more strange and difficult to observe.

Do you believe that data collection is becoming easier or harder? What changes have led to such a result?

David: In many ways, data collection is becoming more difficult. Bot management techniques have become much more sophisticated, more sites require high reputation IPs for any sort of scalable traffic (thanks Oxylabs!), etc. Entering the space as a newcomer without a strong toolkit would be much more difficult now.

However, there are other evolutions that have worked to our advantage. Our toolkit has advanced alongside those of the security providers and CDNs, so we have many more techniques at our disposal when we do run into issues.

Additionally, it is much rarer to see a website without JSON REST endpoints or JSON embedded in HTML these days. This often means that the data extraction layer of a scraper is a bit faster and simpler to write and more robust to style changes than in the old server-side rendering days.

Paul: Data collection is a constant cat-and-mouse game. It is becoming easier in many ways – the tools we use evolve and more community knowledge is available. On the flip side, bot detection and security protocols are always becoming more advanced and sophisticated. Being on the data collection side of things you always have the advantage, because these websites don't want to block legitimate customers from accessing their webpages; so that means the goal of any good data collection system should be to mimic legitimate user traffic as closely as possible.

Paul Morgan will delve deeper into the topic of Data Collection: Orchestration, Observability, and Introspection in the upcoming web scraping conference OxyCon. In his talk, he'll share some strange and challenging moments his team encounters in their daily work.

OxyCon is an annual conference organized by a leading web scraping infrastructure provider Oxylabs. The two-day event will provide both strategic insights on public web data collection and easily applicable practical tips for anyone involved in the field. The event will take place on the 7-8th of September online, and the free registration is

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

_____________

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

David Cohen