Interned data extraction, also known as web scraping, has existed in a sort of a gray area, both legally and morally. While the process can be both extremely useful for the user and benign to the target websites, there’s no denying that scrapers and bots, in general, can be a huge nuisance too.
The subject of data extraction may seem uncertain to some, maybe shady, but it’s actually very straightforward. Let us show you:
Extracting data — legal or not?
The action of extracting data in itself is legal. After all, it’s almost the same as a normal user looking at that data and storing the information in their brain. It’s just more resistant to memory lapses when stored on a computer.
The type of data you gather and what you do with it in the aftermath is the crux of the problem. We’ll go over both subjects and define what’s nice and what’s naughty.
When a person or company creates content, be it text, software, images, music, or whatever else, they own it by default. That constitutes it as copyrighted data, and it’s illegal for someone else to take it and use it for commercial purposes.
If it’s posted publicly, as in, you’re allowed to access it, it’s ok to extract the information. What isn’t legal is to repost it under your name or brand, sell it to others, incorporate it into a product or service that makes you money, or other exploitative actions.
Check the copyright laws in your country if you’re unsure whether the actions you want to take would violate copyright laws.
Another big subject is personal information, the type of data that has led to GDPR and a whole load of companies bugging their audience for consent to continue sending them emails. WHile GDBP only applies to EU citizens, each country in the rest of the world has its own rules, some more outdated than others.
Personal information is anything that can be used to identify that person, so it covers:
- Email addresses;
- Phone numbers;
- Dates of birth;
- Credit card information.
As far as the GDPR is concerned, the general rule for extracting personal data is to not do it. There are exceptions, though: if the person gives their consent, which is rare and complicated to obtain in bulk, and if the scraper has a legitimate interest in the data, which is difficult to prove.
Some Terms of Service may apply
After squaring things away with the law, there’s the concern of what the search engines want. Service providers like Google, Bing, or Yandex all have their own Terms of Service that users agree to in order to use the engines.
This may be news for you since no one really reads those, but rest assured, if you have a Gmail account, you’ve explicitly accepted their ToS.
Anyway, like different countries, each business is free to define its own rules and guidelines. What you can definitely expect from each one is something along the lines of “don’t disrupt our business and don’t cause harm.”
For some companies, any sort of scraping may be viewed as a malicious action, even though you’d have to send a vast number of requests in quick succession to make a dent in a search engine’s processing power.
In other cases, the search engine developers may recognize the need to quickly extract and store data in JSON format, so they may even capitalize on the opportunity by releasing their own APIs and billing methods.
Just because something is against the Terms of Service doesn’t mean it’s against the law. Instead, if a search engine explicitly states that they may block users who do certain actions, well, you can expect to get blocked for doing those actions.
The definition of “harmful actions” can be rather vague, so it’s recommended that whenever you’re extracting SERP data to use proxies. You’re not dealing with the developers but rather their scripts designed to protect against malicious bots. Even if your extracting software is benign, they can’t really stop to ask it, so it’s a bit of a wild west environment, where bots shoot first and ask questions never.
Come to think of it, that’s an apt description for the Internet in general. Data extraction is considered by many a legal gray area because it’s largely unaddressed in legislation and there’s no consensus between different states on a universal policy on web scraping. GDPR is the closest thing, and that only applies to the European Union.
How to get valuable data without stepping on any toes
Leaving regulations and search engine policies aside, it’s pretty clear that all anyone wants is to be treated nicely and fairly. You can do that while also getting the insightful data you want from SERPs.
Here are a few basic rules to follow so that you don’t accidentally cause harm:
- Read the website’s Terms of Service to learn their stance on data extraction.
- Do the same for the robots.txt file. It’s kind of the same thing as the ToS but made for the robots themselves to read.
- Only gather the data you actually need. A “measure twice, cut once” approach ensures that you don’t waste time on useless info while also sending as few requests as possible through the data extraction tool.
- Avoid collecting personal information. Unless you have a really good reason to gather it, stay away from information that is protected through GDPR or similar legislation.
- Be mindful of copyright laws. The rules are clear on intellectual property, and taking them seriously is key in not getting a surprise lawsuit.
- Extrapolate, don’t reuse. The power of SERP data is in the insights that you can gather from them and the superior strategies that you can craft. Searching for top-ranking content just to copy it is unneighbourly.
This article has been mainly on the topic of if you should extract data from SERPs and how to ensure you do it safely. But there’s also the matter of how to actually go through with it too. On that matter, it’s a lot less work actually.
Here’s our advice: use the SearchData REST API! You even get 100 free searches to see for yourself how much a piece of software can change your strategy.