From distant places in rural province Guizhou to the crowded streets of entrepreneurial capital city Beijing, big data is buzzing everywhere in China. Industry people say that it is to artificial intelligence as gas is to cars.
Yet, in the emergence and prosperity of China’s big data industry, over-exploitation of user data stirred growing discontent among citizens.
In 2019, a series of scandals over the industry’s mishandling of user data were exposed. In March, “CCTV 315 Gala,” a nationally broadcast show produced by the state-owned TV station to shed light on illegal business activities in China, made big data one of the show’s focuses. Later, office of Qiaoda Technology, a Chinese big data firm claiming to have processed information of more than half of China’s population, was raided by authorities in suspicion of mishandling those information.
State power’s into big data companies were not a surprise to people in the industry, as similar incidents had been reported as early as two years ago. In 2017, China’s Cyberspace Affairs Commission, Ministry of Public Security and Ministry of Industry and Information Technology called on Taobao and WeChat, two dominant internet products in China operated by Alibaba and Tencent respectively, to explain to the public their privacy protection options. The joint working group also gave “recommendations for privacy protection improvements”, which is essentially rectification orders, to the two companies.
In 2019, regulators summoned senior executives from China’s banks to question them on how their apps are collecting users’ personal information, while Beijing Municipal Public Security Bureau targeted unsolicited web scraping technologies in its recent operations against illegal internet activities.
For Qiaoda, it was scraping data off of other websites that invited the police raid. A person whose company’s data was scraped by Qiaoda and who did not want to be identified, told PingWest that Qiaoda had stolen tons of online resume from his company, and that he had no option but to report Qiaoda to the authorities. A report by Beijing Police mentioned that “Qiaoda scraped large amount of user information from many companies through proxied IP addresses and faking device ID which avoided victims’ server protection, and took advantage of the data for profit without users’ content.”
A former Qiaoda employee told PingWest that a good portion of his co-workers had worked in victim companies before, are familiar with their web systems and how to low key scrape data in an efficient way. “crawling resume is a common thing in our business,” he said.
There is no law or official regulation applying to web scraping in China as of right now. So when companies sue each other over scraping, the most common law used is the Anti-Unfair Competition Law of China, in which Article 12 forbids a company to use technical methods to interfere with another one’s business operation within the legal framework. In 2015, Sina Weibo, China’s largest social networking company, filed a lawsuit against Maimai, a social network for professionals, over scraping large amount of Weibo’s data and refusing to delete the data after an existing partnership between the two has expired. Setting jurisprudence, the court decided in 2017 that the defendant participated in unfair competition.
According to the former employee, Qiaoda’s approach to obtaining data also includes purchasing from the online black market. “I took part in competitive bidding. In order to offer more data and proof of our ability, we bought data from the black market prior to bidding,” he said.
The black market transaction flow can be roughly described this way: hackers get data through attacking websites, and sell them to data agents, who then could sell to other agents or data companies. Eventually the data is sold to target companies which are desperately in need of the data. Prices often multiply along the way.
In fact, a lot of so-called “big data” companies are in the data agent business.According to Xinhua News citing authorities in East China’s Shandong Province, a big data company named Datatang transferred compressed data in a total of 4,000GB over eight months, containing 40+ data points such as user’s name, gender, age, phone numbers and more. The company transferred over 130 million pieces of data point information per day. Selling these information is the way Datatang generates income.
Most companies track users and collect data through technique called event tracking, in which events are a user’s specific interactions with an app or webpage, for example, clicking on a certain button. Developers set up event tracking beforehand by adding tracking code to an element, and by doing so users’ interactions will be recorded and then transferred to servers for analysis.
Using this technique, it is possible to fully reproduce a user’s history of interactions with an app, almost mimicking a screen recording. “Which button you clicked. Which button next. The speed of you scrolling through the feed. Which page you spent the most time on. We are able to see all of that,” said the director of data in an online-to-offline (O2O) company based in Shanghai. “With an increasing need for more accurate recommendations to users, we must collect more data,” he said, that eventually event trackers are set up everywhere, because it's better for him to collect the data first than to have no data when direly needed. For instance, Bytedance’s news aggregator Toutiao has been collecting the lists of apps users installed on their phones, whereas many other big data companies refrained to do so.
A number of companies told PingWest that when collecting sensitive data, those are masked most of the time. For example, a phone number like 123-456-7890 will be partially hidden and shown as 123-***-7890, so that big data companies cannot identify the user. But it's hardly foolproof since many apps also collect, but do not mask, two pieces of datapoint information called Media Access Control (MAC) and International Mobile Equipment Identity (IMEI), both of which are unique identifiers and, once matched, can confirm a user’s real identity.
Researchers in China’s Ministry of Public Security told PingWest that Chinese authorities are keeping a close eye on the EU’s General Data Protection Regulation (GDPR), which is hands down the strictest regulations in the world. However, when making its own version of regulation, Chinese authorities aim to differ.
For example, GDPR mandates that users should manage their own data and companies like Facebook and Google should provide them with such tools which decision makers in Chinese authorities deem too radical; While GDPR mentioned anonymization of user data, Chinese authorities chose to evade that word by using “derived data”, which are essentially data processed in certain manners so that it becomes unidentifiable.
(This article is an abridged version of the original investigation by PingWest on China’s big data industry, written by Wang Zhaoyang. This article is written by Ran Yu. To learn more about big data in China, go to the original Chinese version.)