OpenGov spoke to Mr. Steve Totman (above), Financial Services Industry Lead at Cloudera about the use of big data by NGOs and charitable organisations to solve complex real-world problems and how Cloudera is involved in such initiatives.
A few years ago, MasterCard approached Cloudera to develop a PCI-Compliant Hadoop Environment2. In the credit card industry1, it is essential to ensure that cardholder data is properly secured and protected and that merchants and third-party solution providers meet minimum privacy levels for any application, database, or file system that plays a role in storing, processing, or transmitting account-related data. The Payment Card Industry Data Security Standard (PCI DSS) was formalised as an industry-wide standard in 2004, originating as separate data security standards established by the five major credit card companies: MasterCard, VISA, Discover, American Express, and the Japan Credit Bureau.
NGOs are also dealing with incredibly sensitive data. Mr. Totman said, “Sure you get upset if you lose your credit card. But imagine what it is like when you are dealing with victims of domestic abuse or with homosexual victims in a country where they might be persecuted or even killed.”
It turns out that the tools and frameworks used by multinational banks and credit card companies for collecting, processing, and protecting data as well as finding wrongdoing, translate remarkably well to meeting the needs of NGOs dealing with some of the most vulnerable populations and the most dangerous criminals.
Going further, consider a large financial regulatory body that tracks stock transactions, looking for insider trades. They are getting data from hundreds of brokerages, not just structured transaction data, but also audio files, e-mail communications, and text messages. All that data is placed into a massive store, so that they can look for individual stock transactions with specific characteristics indicating collusion. That kind of logic is exactly what the NGOs would want to use if they are looking for people doing bad things or searching for people who need help. For example, organisations looking at trafficking will look at things like craigslist or discussion boards in the dark web, pulling in unstructured data in the form of video, audio, text, images etc.
Hadoop is well-equipped to handle structured data and it can blend structured and unstructured data together. In traditional relational databases, systems are structured around a data model, a schema. In Hadoop, you can store any form of data and flexibly apply the schema afterwards. This was one of the three points mentioned by Mr. Totman, explaining how Hadoop differs from legacy databases and how it is especially suited to deal with the requirements of large corporates such as banks and telecom companies as well as non-profits trying to use data to tackle challenging problems.
The other two differentiators are the significantly lower costs of dealing with large volumes of data (between 20 and 100 thousand dollars a year for 1 Tb of data on traditional databases vs a couple of thousand dollars on Hadoop) and flexibility of adding new data sources and analysing them within short time frames; a few hours compared to few months on earlier systems.
Mr. Totman walked us through a few examples of the kind of big data applications he had been talking about.
An Israel-based data analytics company, Treato, aggregates patient experiences from the Internet, organising them into usable insights for patients, physicians, and other healthcare professionals. It crawls the entire web for medicines, symptoms, side effects, and other health-related user generated content.
The volume of data is not the only challenge. Treato also needs to process colloquial language, such as that used in social media posts, combine it with medical terminology, and translate it into actionable insights. By 2013, Treato had aggregated and analysed more than 1.1 billion online posts about over 11,000 medications and over 13,000 conditions from thousands of English language websites. The Treato website currently claims to provide information on 14,748 symptoms and conditions and 26,616 medications and treatments.
In collaboration with Cloudera, Patterns and Predictions, a predictive analytics firm developed an artificial intelligence (AI) solution that predicts mental health risk through opt-in analysis of social media and mobile text, with the goal of identifying indicators of suicidality, particularly among veterans, so that preventative action can be taken. The solution represented an extension of previous collaborations between the two organisations as part of The Durkheim Project, a DARPA-funded research program that ran from 2011 to 2015 and demonstrated the capability of big data technologies to effectively detect suicide risk at Internet scale.
Thorn is the organisation referred to in the title and referred to a few times previously in this article.
Thorn: Digital Defenders of Children is a non-profit dedicated to driving technology innovation to fight child sexual exploitation. Thorn partners with players from the technology industry, government, and non-governmental organisations, working to deter predatory behaviour, disrupt platforms that enable abuse, and accelerate victim identification.
Children are often bought and sold online, using online classified sites or escort pages (63% of child sex trafficking victims, according to the Thorn website). If technology was facilitating these heinous crimes, Thorn wanted to find the solution within technology to leverage the online information about these crimes to more rapidly find these children and connect them with victim services.
Thorn and Digital Reasoning (provides cognitive computing services to intelligence agencies and financial institutions) created Spotlight, a cloud-based collection and analysis tool used to provide intelligence and leads on suspected human trafficking networks and individuals to identify and assist victims. Cloudera’s CDH platform provides the infrastructure, which provides both distributed processing to run state of the art natural language processing and analytic algorithms on data that are harvested and organised in HDFS.
Spotlight has become the leading investigative tool for child sex trafficking investigations in the United States, with over 1,300 law enforcement users across 46 states.
Going the extra mile
Mr. Totman explained that it is not that difficult for the charities to get software or consulting services at little or no cost. But they also need skilled people who know how to use the resources and how to deal with data correctly.
Through Cloudera Cares, employees are encouraged to donate time and resources for these initiatives. And the company’s customers have also expressed interest in getting involved. They are searching for mechanisms to get involved. Typically, they will throw money at the problem. But they can also provide data scientists. Cloudera is attempting to facilitate this borrowing of talent.
For instance, Cloudera recently collaborated with Intel and the National Centre for Missing and Exploited Children (NCMEC) on a month-long virtual hackathon to focus on innovative ways to locate missing children. They also organised a hackathon last year to explore new ways of using data to fight and prevent the Zika virus. These events provide opportunities for Cloudera and its partners to contribute to the use of “data for good.”
At the recent Strata Hadoop World San Jose event, Mr. Totman moderated a panel discussion on “Big Data as a force for good” to discuss using data for good and addressing the unique challenges humanitarian organisations and not-for-profits face in the big data world. The panelists included NetHope, a non-profit organisation working with over 20 international development organisations to identify key ICT-related needs related to the Syrian refugee crisis. Its efforts have included providing Wi-Fi hotspots and charging stations in camps and along the migration route. As Mr. Totman explained, the first thing the refugees need when they get off the boats is food and water. The next most important thing is connectivity. They need it to inform their families that they have made it till there. Sometimes, it becomes essential for their safety and survival; like during a period in late 2015 and early 2016, when applications for asylum in Greece could only be submitted through Skype.
To give someone a Wi-Fi connection, you end up storing the MAC (media access control) address of the phone, which entails some basic information about the person. The General Data Protection Regulations (GDPR) of EU include the right to be forgotten, essentially meaning the right to be deleted. For NetHope to have the ability to delete the information, they would also need to store additional information, so that later if they want their information to be deleted, they can prove that the information belongs to them.
This adds to the burden of protecting the information, guarding against antagonists engaged in the Syrian conflict from infiltrating and crippling the network, and exposing both refugees and humanitarian aid workers to outside risks. There are also the risks of a private entity or any government executing hacks that support their national interests. Strong cybersecurity and privacy protocols had to be integrated into the network.
Mr. Totman said that data wants to be shared. But he pointed out that there are concerns regarding storage, protection, and ownership. There are strict legal and ethical implications around that.
Cloud platforms offer a range of interesting options. But it also matters where the cloud platform has a local data centre. Anonymisation or tokenisation play an important role. Anonymisation turns data into a form where information about individuals cannot be recovered. Tokenisation is doing it in such a way that the data can be recovered under certain legal circumstances.
There are questions around what to anonymise and what to tokenise. With anonymisation, the frequency of the data (for instance you choose to anonymise an uncommon surname but once the user is in a country, where the first name is rare, that could be enough for identification) and the relativity of fields have to be taken into account. Cloudera went through these kinds of issues with MasterCard.
Cloudera strengthened its encryption capabilities with the acquisition of Gazzang in 2014 and later pushing the encryption itself into the chipset working directly with Intel. Today, hackers are very sophisticated and organised, sharing data, information on vulnerabilities, and hacking tools. But companies have not been coordinating in the same fashion. To bridge this gap, Intel and Cloudera initiated an Open Source project called Apache Spot.
Cloudera has developed a data governance solution called Navigator, which enables monitoring access to sensitive assets and seamlessly enforcing policies across the enterprise. The data lineage or provenance can be traced through Navigator.
Ultimately data governance is a combination of people, processes, and technology. There are frameworks like privacy-by-design which help. But there are no simple answers.
Data can be a force for good in the world, helping chip away at apparently intractable problems. But it’s not enough to have data to solve a problem. You must show how it was collected, how it was stored and used, and it has to protected all the way through. Security, lineage, and governance – they matter to banks and to charities. The transfer and sharing of tools, talent, and knowledge would help in unlocking that true potential of data, while dealing with the tricky concerns.
1Steve Totman is Cloudera's Industry Leader in Financial Services, Data Management Tooling and Ethical Data Governance, helping companies monetize their Big Data assets using Cloudera’s Enterprise Data Hub. Steve works with over 100 customers worldwide and helps several verticals in building architectures through data management tools and data models. Prior to Cloudera, Steve ran strategy for a Mainframe to Hadoop company and drove product strategy at IBM for DataStage and Information Server after the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance/metadata related designs.
2Cloudera is the largest provider of Apache Hadoop based software, support and services. Apache Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.