EXCLUSIVE- Discussion on Big Data challenges – unstructured data, privacy, sharing, integration
Senior executives dealing with ICT from public sector, education and health care organisations in Singapore came together at the OpenGov Leadership Breakfast Dialogue, ‘The Big in Big Data – Managing the unmanageable’ in Singapore on the 10th of November. Two hours of interactive discussion yielded fascinating insights into a range of issues related to collection, storage, sharing and analysis of Big Data.
Christopher Aw (below, second from right), Regional Lead, Public Sector Programs, MarkLogic initiated the dialogue talking about changes in ‘model’, ‘mentality’ and ‘mission’ in the public sector. Data models have evolved from hierarchical to relational to an era, where massive volumes of data, whether it is in military intelligence or patient healthcare records, is unstructured. Storing operational data in relational data models is losing its utility in making sense of the data and gaining insights from it. Mr. Aw quoted a number of 12%, as the proportion of enterprise data which was in highly structured databases, as of 2014.
Mentality is shifting from a system-centric approach, where there are many different applications, each with a different back-end, being fed data from multiple sources. The current trend is in favour of a data-centric approach, where all the data is processed in one place, being loaded and indexed from multiple ever-changing sources and the output is delivered to the right user in the right format in real time.
In terms of ‘mission’, Mr. Aw said that IT is merging with operations. It is leading to requirements such as Joint Metadata catalogs for enabling simultaneous search of disparate databases.
These trends necessitate a shift in the way data is dealt with, the manner of its collection and storage.
Henry Chao, Former Deputy CIO and Deputy Director, Office of Information Services, Centers for Medicare & Medicaid Services in the US shared his experience leading the creation of the insurance marketplace as part of the implementation of the implementation of the Patient Protection and Affordable Care Act (ACA) or Obamacare. The Act was hoping to provide affordable coverage to additional 20 million Americans.
Mr. Chao broke down the timeline from the signing of the ACA in March 2010 to the launch of healthcare.gov in October 2013. Such as a vastly ambitious project represented a range of challenges regularly faced during implementation of large-scale ICT projects in the public sector.
There was uncertainty in the scope. It was forever changing. Regulations had to be written that laid out how the programme would operate. For a long, the team operated on a monthly stipend, making it difficult to award contracts. Connections had to be established with a whole bunch of federal and state level agencies. The traditionally months-long process of insurance underwriting had to be reduced to a few seconds. Over 1600 different insurance products had to be integrated and tested.
The information provided by the applicants filling up the online forms had to be preserved, while shifting some of the applicants to state programme more beneficial for them.
It was completely new set of complex problems that had never been dealt with before. Mr. Chao highlighted the key question of when are you going to have enough information to build the critical pieces and have a minimum viable product. The team adopted its own brand of agile development, parallel development of business processes and communication plans with stakeholders.
If they had to absorb the shocks and the volatility of ever-changing requirements, refactoring the relational model as many times as the application code would make the process significantly more challenging. During the last three months, there are 180 builds, an average of 2 a day. In that scenario, you don’t want to be encumbered by a changing relational model. A NoSQL database could help in tackling these issues.
Next, Klaus Felsche (above left), Former Director of Analytics from the Department of Immigration and Border Protection in Australia spoke about data needing to support decisions, actions and services. He presented a 3-step process of seeing, understanding and acting to enable evidence-based decision-making and solve business problems.
Governments collect huge amounts of extremely valuable data, which could be potentially used to improve services and consequently the lives of the citizens. Mr. Felsche said, “We now need to construct systems where we don’t know all of the questions data can answer. We don’t even know some of the questions we need to answer.”
But a pre-requisite for this is the ability to collect and store information in a way that it is available for analysis. Mr. Felsche brought up what he called ‘invisible data’. That is data that cannot be used. It might be lying on a tape in a vault somewhere or in some inaccessible part of the network.
Questions and discussion
The first question posed to the delegates was, ‘What are some of the biggest data challenges in your organisation’. 40% responded that it was the difficulty in accessing information. Increasing efforts to manage data and challenges posed by manual aggregation of data garnered 25% and 15% of votes respectively.
Rupert Gwee (below right), Director, Human Resources Transformation Office, National Service Affairs Directorate, Human Resources Division, Ministry of Home Affairs spoke about data not being organised and tagged properly, making it difficult to analyse it. Forward screening is what is required. You have to come up the concept and then articulate the requirements. In other words, it is about having a clear problem statement and figuring out what data sources will have to be grouped together. Otherwise people do not know how to organise the information. It is not collected in a way that would be useful.
The issue of data silos and privacy also surfaced in the subsequent interesting discussion. Vivien Chow (below), Director Applied Innovation and Partnership, Government Technology Agency (GovTech), responded that integration of data from different sources is required. But the different agencies are at different stages of understanding how to anonymise the data. Sometimes the data set on its own might be adequately anonymised. But it could be reidentified when combined with other data sets. The hurdle of effective anonymisation has to be surmounted before encouraging data sharing.
To tackle the issue, GovTech is working on a proof-of-concept (POC) for homomorphic encryption, which can enable analysis of encrypted data. If the POC is successful, it can be used across the government agencies.
Peter Tan Chin Seng (below right), Principal Architect, National Architecture Office, Integrated Health Information Systems (IHIS) Pte Ltd said that even after anonymisation of identifiers, data can be re-identified sometimes. Especially in the boundary cases. For instance, say 99-year olds with a specific condition.
Mr. Seng also talked about a problem with early technology adoption, as happened in Singapore’s hospitals. Now that there is a lot of data which is difficult to integrate because of change in data structures and existing systems being based on the older ones. Efforts to change in order to harmonise data can face resistance.
Current technology is capable of redacting certain part of the information, such as edge cases, on the fly, whereas before separating out that one vector was difficult. Then the data can be shared without compromising privacy. Encryption, anonymisation and redaction are the three keys to this. Earlier you would need to encrypt the entire database. Now it can be done based on certain criteria.
The conversation veered to the presentation of data. Paul Gagnon, Director, E-Learning, IT Systems and Services, Nanyang Technological University - Lee Kong Chian School of Medicine said that his biggest challenge is to find the best possible way to display data to the front-end user, so that they can find information quickly. Different groups of stakeholders can have widely varying opinions on it and factoring in everyone’s needs or demands can prove to be a tough challenge.
Mr. Felsche said in response that putting out a viable product is important, as subsequent iterations can keep improving it. Trying to satisfy everyone can place government in a gridlock.
The next query to the delegates was about how they manage unstructured data (documents, attachments, pictures, sound, video etc). Here, a majority, 67%, replied that some of the unstructured data is included but it is not possible to include all.
Dr. John Kan, Chief Information Officer, Agency for Science, Technology and Research described the current approach of designing a content management system around the requirements, so that the essential data at least is classified and stored properly. Sometimes, you might need to choose what to manage.
Mr. Seng said that currently transactional systems are mostly relational. So, for dealing with data, metadata is being indexed into blocks. But this needs improvement.
The right metadata can be critical in managing data. Mr. Gwee provided another angle to this aspect:
Sometimes, there is over-analysis on how to use data. Basic analysis can suffice for most government needs. But when you want to move to the next level, that requires a paradigm shift. He gave the example of 300,000 to 360,000 people crossing the Johor–Singapore Causeway every day. The volume is huge and it is not like the airlines, where identity of the passengers is verified in advance. Managing that flow, while avoiding intrusive methods, demands smart approaches. Sometimes, simple ideas could be the smartest and heavy crunching might not be required every time. Like tracking army operations by keeping a tab on purchase of food items from provision shops by the soldiers’ wives.
When asked about the most important IT priorities, responses were split, with 40%, 30% and 20% for digital transformation and innovation, improving efficiencies and costs and developing/ deploying customer-facing applications respectively.
It was pointed out by Mr. Mohamad Azman Jaffar, Deputy Director Information Technology, Public Service Division - Prime Minister's Office that transformation and innovation sort of encompasses the other options. Lim Soo Tong, Chief Information Officer, Jurong Health Services concurred, saying that it is their mission.
Also, earlier efficiencies and costs were the primary focus for ICT. Now that is no longer the case. Senior managements demand digital transformation to support business objectives.
Data can be used to make the business case here for transformational initiatives, from a holistic perspective. Possibilities of early intervention from predictive analytics and cross-pollination resulting in new viewing angles, through combinations and association of different data sets were also discussed. The dialogue moved to the critical role of interagency collaboration, for instance, between the Ministry of Health and Social services, to achieve these kinds of objectives.
Around 67% of attendees said that their mission critical data resides in multiple Relational Database Management Systems (RDBMS). In most Singapore public sector organisations, that data is already on enterprise document management systems and shared.
Dr. Kan said that it was important to know the initial process which generated the data in the RDBMS to know what was included, what was left out.
Concluding the dialogue, Mr. Aw talked about the process of continuous learning and improvement. There are still many gulfs to be traversed and potholes to be avoided. But there is no avoiding data. Data, in ever-increasing volumes, velocity and variety will continue to expand its role in how governments function and governments have to evolve, adapt and progress.