Small firms, big data

31 October 2017

Many startups offer their clients big data analysis services based on machine-learning algorithms. The results of such analyses can be of interest to any companies profiling their products or marketing campaigns. But for the analysis to be reliable, it takes data—the more the better. Algorithms using machine learning must have something to learn from. The accuracy of the forecasts subsequently developed for business aims will depend on the scope of the training data fed to them. If the algorithm is limited from the start to analysis of an abridged sample of observations, the risk increases that it will incorrectly group data, overlooking important correlations or causal connections—or see them where they don’t exist. Only training the algorithm on large datasets can minimise the risk of shortcomings in diagnosis and prognosis.

Ensuring access to large datasets is not easy, however, in the run-up to the effective date of the most rigorous data protection provisions, the EU’s General Data Protection Regulation. This is particularly problematic for startups, which unlike bigger players such as online stores or social media sites cannot generate big data themselves by observing a large base of users. Gathering publicly available data is not very helpful either, as they are largely macro data of limited utility for individualised evaluation of customers’ problems, are often outdated, and in the case of scientific data were collected on the basis or more or less arbitrary selection criteria by the author of the study, limiting the reliability of the data.

Another option for small players is to obtain data from their own clients. Banks, healthcare institutions and other entities hiring startups to analyse data can entrust them with the data of their customers, under fairly relaxed formal conditions. But such entrusted data can be used only for purposes which the original customers consented to. The matter thus becomes complicated when the customers’ consent does not cover processing for the purpose of training algorithms. The GDPR does provide for exceptional situations where such data can be processed even without the customers’ consent, e.g. due to overriding legitimate grounds, but perfecting an algorithm might not qualify for that test, particularly when the processing is being done not by the service provider but by a startup it has commissioned to do the analysis. Great difficulties could also arise in specifying the scope of the consent. The notion of using data for the purpose of improving the process of machine learning might be disputed as insufficiently precise, unclear, or even unintelligible.

One solution could be anonymising the data. Anonymisation converts the data into a form permanently preventing them from being attributed to a specific person—unlike pseudonymisation, where the data are encrypted with the possibility of reversing the process and reassigning them again to specific individuals. Pseudonymisation is a measure for protecting personal data; anonymisation is a method for excluding such protection. Because anonymised data cannot be attributed to any individual, they cease to be personal and the GDPR no longer applies to them (as confirmed by recital 26 of the regulation). Consequently, anonymisation can allow startups to make extensive use of big data supplied by their clients. But there are two essential issues to note.

First, total anonymisation, permanently preventing data from being attributed to an individual, is harder to achieve than it might appear. Intuitive solutions, such as removing names from datasets, do not exclude re-identification (as may be seen for example in the cases discussed this year in the Journal of Biomedical Informatics). The place of residence or age of the person may enable the person to be identified. A study in the US in 2006 showed that combining the postal code with the date of birth and the person’s sex narrowed the field of “suspects” in nearly every instance to no more than five people of America’s 300 million citizens, and in 63% of the cases was sufficient to zero in on the exact person!

Second, although anonymised data are no longer subject to the GDPR, the GDPR does apply to the process of anonymisation itself. The regulation covers processing of personal data, and processing is defined as any operation performed on personal data or sets of personal data, whether or not by automated means. Examples of processing given in the definition in the GDPR include adaptation, alteration, alignment, restriction, erasure or destruction of data. The anonymisation process itself boils down to operations such as adaptation and alteration. The argument that anonymisation qualifies as processing is supported by the fact that the concept of processing also includes other operations causing the entity performing them to cease to be covered by the GDPR, such as erasure or destruction of data.

This means that the provisions permitting processing of data only for the purposes covered by the data subjects’ consent also apply to anonymisation. Consequently, a service provider hiring a startup to perform analysis and willing to provide the startup with its customers’ data for training the algorithms must first obtain the customers’ consent to having their data anonymised. However, phrasing of such consent should be easier than for consent to processing for purposes of training algorithms. In this case it should not be necessary to specify the processing methods the customer consents to after the data are anonymised—because after the data are anonymised they will no longer constitute the customer’s personal data or data protected by the GDPR, and any such limitations could not be enforced.

Therefore, anonymisation of data may facilitate the use of data obtained by startups from their clients with a larger base of users. But this does not end the search for simpler schemes for small businesses to obtain access to big data.

Bartosz Troczyński

Bartosz Troczyński