Understanding data collection strategy by Richard Benjamins

By Guest Contributor Richard Benjamins

Author of A Data-Driven Company, Richard Benjamins, explains what data collection strategy is and how to implement it in your organisation.

Data is not a uniform asset: depending on the sector and organization, there are many different types of data, each with its own peculiarities. It can be structured, unstructured, semi-structured, internal, external, traditional or digital. Therefore, as an organization, it is important to have a data collection strategy that defines what data to collect, and when. Data collection has an associated cost and therefore needs to be part of the budgeting process.



We live in an era of abundant data. Studies show that the amount of data has grown exponentially, from 1-2 exabytes in 2000 to 40,000 exabytes in 2020 (Smith, 2019). And, it will inevitably keep growing in the future. When speaking about data, it’s important to distinguish between first-party, second-party and third-party data (Paulina, 2017). In brief, first-party data — also referred to as internal data — is generated by an organization itself, based on its own operations and activity. Second- and third-party data refer to external data sources. An organization accesses second-party data through partnerships, using signed agreements. Third- party data is procured in the market and usually based on a range of external data sources.

Most large organizations start their data and AI journey focusing on first-party data, and this is the main focus of this chapter. First-party data is directly related to the organization’s business, and therefore has obvious value for those starting out on their data and AI journey. Some sectors, however, have richer first-party data than others, which might need to rely on second- and third-party data. For instance, telecom, banking, and tech giants (the so-called ‘GAFA’ – Google, Amazon, Facebook and Apple) have rich first-party data, while the insurance sector traditionally is lacking an important part of first-party data. That’s because, in that sector, sales and customer relationship data often sit with brokers and agents who manage that part of the value chain. Insurance companies’ customer data is often limited to bank account information because that is needed to pay customer claims.

In summary, the fact that there is general data abundancy does not mean that data collection is trivial, easy and free-of-charge. On the contrary, data collection is a complex process that needs to be managed explicitly, as a pivotal element in a data strategy.



Companies’ traditional data sources include CRM systems, billing data, transactional data, data from physical shops, and data related to customer evaluations, such as the Net Promoter Score (NPS). Collecting and preparing data from those types of sources for analysis and reuse is already a technical and organizational challenge.

As the world and businesses have become increasingly digital, new data sources have entered the picture. These include website data, data from apps, usage data relating to products and services, social media data, call centre content data, location data and others. Moreover, while traditional data was mostly structured, many of the digital data sources are un- or semi-structured, such as call centre content and social media data. Before relevant information can be extracted from such sources, AI technologies, such as NLP, speech recognition and image recognition, are often required.



It’s one thing to have necessary access to data sources, but another to store that data properly, as we’ve seen in previous chapters. Data sources are usually scattered around the organization, on-premise, in the cloud, with third parties, etc. An effective data collection strategy cannot do without a clear policy on where the accessed data (as described in this chapter) should be stored. It would be wrong to store it locally, close to the data source, as that would lead to data silos. Organizations that are serious about their data journey should have a storage strategy that ensures proper data management (provenance, quality, permissions, etc.) and efficient access for use cases. Any organization can be viewed as a layer of data sources. There are the physical assets that generate data, there is the IT infrastructure, and there are the products and services. All of them generate data that needs to be captured in a coherent and manageable way. In Telefónica, these layers are referred to as the first platform (physical assets), second platform (IT infrastructure), third platform (products and services) and fourth platform (data platform). The data collection strategy states how each platform needs to provide its relevant data to the fourth platform.



It goes without saying that any data collection strategy needs to take into account privacy regulations for data protection, in the event that personal data is collected and stored. In my experience, large organizations have taken a huge leap forward in this respect — thanks in Europe at least to GDPR oversight — and are taking privacy concerns seriously.



Finally, there is the question of open data (see Chapter 9), and whether this type of external data should be considered part of an organization’s data collection strategy. Open data is published by an organization and can be reused by anyone for free, without limitations (Open Data Institute, 2017). The basic underlying idea is that much more value can be created if data is reused by many, rather than closely guarded by the data holder. In the end, data is a ‘non-rivalry good’ that isn’t depleted and can be reused repeatedly.

The use of open data by large organizations is far from being mainstream, but it does happen. For public institutions, openly publishing data is in many countries compulsory, but it’s almost non-existent for large corporations. The added value of using open data for organizations is that it is free and can enrich internal data sources. When organizations want to use open data, it should be part of their data collection strategy. While open data may seem too good to be true, we saw in Chapter 9 that there are actually numerous pitfalls associated with using it for business-critical data activities.

Let’s recap:

First, since the data is external, there might be doubts about the quality and provenance, and this needs to be understood. Transparency is important here.

Second, most data becomes obsolete after some time and needs to be updated or refreshed. Open data is not always reliable regarding the stated frequency of updates.

Third, there’s the liability problem. Open data is usually provided for free (by public institutions) because this is compulsory. But the obligation is about the publishing, not about quality or updates. So, if an organization uses open data somewhere in its data value chain, and the open data part fails (out of date, not available, etc.), you have a problem. If an organization uses open data in paid services to customers, who is responsible (legally liable) for the service not being delivered? Is it the organization or the provider of open data? In practice, it is not the open data publisher who’s liable, so organizations need to be aware of that.



It should now be clear why a coherent data strategy is necessary for organizations that want to create consistent, sustainable value from data. Data collection cannot be taken for granted, and not explicitly considering this in the data strategy will invariably lead to delays and frustrations when wanting to solve business problems with data and AI.

When designing a data collection strategy, organizations should consider the following:

• What data to collect, and when. As we have seen in this chapter, data should be collected based on use cases (business needs determine what data is needed) and data collection should be considered each time new products and services that will generate data are designed. Priori- ties and available resources will help shape the road map.

• Where and how to store the data. This relates to the earlier discussion around cloud versus on-premise (Chapter 11) and local or global storage, or a unified data model (Chapter 12). Is the data stored centrally, or locally in the geographies? Is it stored in the cloud or on-premise? And, is the data stored as is, or defined by the local business? Alternatively, is it stored in a unified data model? Those decisions are important, as they have an impact on timing and budget.

• Estimation of costs and budget assignment. As we have learned in this chapter, data collection is not a trivial process, contrary to popular belief. Therefore, without assigning budget for this activity, it’s not likely to hap- pen, and definitely not as planned.

• Efforts to break data silos. While organizations may not want to write this explicitly in a strategy document, the people factor is also important to consider. Still, some large, traditional organizations consist of a set of uncoordinated, scarcely communicating silos. Breaking these silos, by requiring that their data be stored in a company-wide data platform, may meet with initial resistance. While this resistance makes no sense from an organizational point of view, it does from a human perspective. In the end, data is power, and people don’t like to lose power. Ignoring this factor might again lead to delays and frustration.


RICHARD BENJAMINS is Chief AI & Data Strategist at Telefonica. He was named one of the 100 most influential people in data-driven business (DataIQ 100,2018). He is also co-founder and Vice President of the Spanish Observatory for Ethical and Social Impacts of AI (OdiselA). He was Group Chief Data Officer at AXA, and before that spent a decade in big data and analytics executive positions at Telefonica. He is an expert to the European Parliament’s AI Observatory (EPAIO), a frequent speaker at AI events, and strategic advisor to several start-ups. He was also a member of the European Commission’s B2G data-sharing Expert Group and founder of Telefonica’s Big Data for Social Good department. He holds a PhD in Cognitive Science, has published over 100 scientific articles, and is author of the (Spanish) book, The Myth of the Algorithm: Tales and Truths of Artificial Intelligence.

LinkedIn: https://www.linkedin.com/in/richard-benjamins/


Suggested Reading

Are you planning to start working with big data, analytics or AI, but don’t know where to start or what to expect? Have you started your data journey and are wondering how to get to the next level? Want to know how to fund your data journey, how to organize your data team, how to measure the results, how to scale? Don’t worry, you are not alone. Many organizations are struggling with the same questions.

This book discusses 21 key decisions that any organization faces when travelling its journey towards becoming a data-driven and AI company. It is surprising how much the challenges are similar across different sectors. This is a book for business leaders who must learn to adapt to the world of data and AI and reap its benefits. It is about how to progress on the digital transformation journey of which data is a key ingredient.

More information