Hadoop Founder Doug Cutting on his Journey, Addressing the Growing Skills Gap, and Smart Cities grappling with Data
During Strata + Hadoop World, OpenGov sat down with Mr. Doug Cutting, Founder of Hadoop and current Chief Architect at Cloudera, to discuss how he never imagined what Hadoop would produce, some of the most heartwarming examples of data analytics use, and what we must do to combat the growing skills gap in the industry.
With this exclusive opportunity to sit down with the man behind Hadoop -who is as humble, as he is intelligent- we had a lot of questions to ask.
How did you start this journey?
I got in by chance as much as anything, more or less. I needed a job, liked programming, and landed a job looking into search problems. I knew some people who hired me to invent some things, and I developed an understanding of how to cope with large amounts of data. Then, I subsequently got involved in open source.
When Google published some papers about the way they were doing things internally, managing their data systems, I had the experience to see that they were really better methods. I had been working on building search engines myself and I knew this was a big step up from what I was doing. I had enough experience with open source to recognize that if these methods were available as open source they would probably be widely adopted.
So I put two and two together, and started implementing these as open source, which became Hadoop a year later. There wasn’t really a grand design, I just happened to be the right guy with the right information at the right time.
What did you imagine Hadoop would become?
I never imagined Hadoop would be what it is now. I had grown up in a world where enterprise software was very different from what researchers used and websites used. There is a different universe of software development and style of building systems. Enterprises are based on relational databases running on big iron, mainframes, where researchers tended to have PCs and work stations.
I think I did not ever contemplate that the two worlds might merge. I thought that I could build some technology that would have a great impact on the research and web sphere, but it would not likely leave that. Now we have seen enterprises are really adopting open source, adopting unix, and in a much higher degree than I ever would have guessed.
In retrospect, I guess this is not so surprising, if you look at Moore’s Law that technology is pervading every industry. Data is permitting institutions to better understand themselves, their users, the context, and improve. Now we really see, data is driving growth in about every industry. I am very pleased to see that something I worked on is enjoying so much use, but this was not my original plan.
There is a growing skills gap in this industry, how would you propose to address this issue?
The adoption of the technology can’t grow any faster than there are people to use it. We are seeing that as a limiting factor for growth. We also see institutions have a lot of other reasons for not adopting new technology. Institutions may evolve slowly and to adopt this platform requires a lot of change, in many cases. Especially cultural change.
So far, all of those things are paced together, the rate new people are learning and the rate that culture and institutions are changing to adopt these new technologies. In some ways, we do not want it to be too fast so we fall on our face. Having some moderate pace is great.
It is important that people get more trained on this. Cloudera has a program to work with universities. We provide a curriculum so that they can teach students, and they come out of college familiar with these techniques. We are working with over 100 universities worldwide and eager to add more to this program.
These days, people are starting to learn about these new technologies anyway. It is the technology that people are becoming familiar with, so to some degree is generational. Some people will learn new technologies in the course of their careers and some people won’t. But the next generation will have those skills.
I do not think this is a fatal problem, I think public institutions will find people although it may be more difficult. I don’t think it is a unique problem for these new technologies, but we are working as best we can and offer training from the very beginning, as a strategy to help the technology spread. Cloudera has helped over 40,000 people so far in using these tools.
How will the public sector get more organisations on board with open data initiatives?
I think it is really important that organisations have buy-in from the top down to tell them that data is really valuable and can really help them improve.
They need to start taking advantage of it, thinking about it, planning around it, and think about the policies about data. What are the ethics for appropriate use of data? For private and public organisations? How can they make people trust them?
Do you think that this top-down approach is best?
It is essential that you have that, it is necessary but not sufficient. You also need people who are familiar with these technologies and who understand it. I think it helps so much, coming from the top. Everyone I have met here in Singapore, and from neighboring countries, seem to be understanding this and taking it very seriously.
Are these open data initiatives and policies integral to the success of Smart City programmes?
It is hard to say that but it is certainly the smart way to operate. If you are trying to build a Smart City or Smart Nation, you want to try and take as many advantages as you can. When a government operates openly, it operates more efficiently and more effectively.
I think it is very important that governments open up all of their processes as governments. It also permits more value to be extracted from the data when it is open. It allows processes to improve, with help from private sector.
What would you advise organisations who hesitate to open up their data?
Security is a technical problem, I think. Now, security goes hand in hand with Hadoop. Now, there are facilities where you can keep your data encrypted at all times and control who can see what.
It can be harder with an open government initiative to decide issues about privacy. What can you publish? Because when you are operating openly, you are intentionally disabling a certain amount of security. You would like most data that you publish to be anonymous because you do not want to reveal private details of someone’s lives. But often times these things just leak out.
While you want to publish data, you want to protect identities. There are a variety of ways to do this, you can anonymise it, you can try to aggregate it and only provide information about groups rather than individuals, or you could have legal controls to prevent sharing to the public. In this situation, the data would be provided to any person or institution that agrees to follow certain rules, be audited to ensure they comply with these rules.
What is the most interesting story you have heard about people using this technology you created?
One project I was very impressed by a children’s hospital in Atlanta. They were gathering data from the neonatal ICU, with premature babies. It was not just that, the way they used this data was not for a big ambitious project. Instead, they just gathered all of the data and asked the nurses what they would like to know.
The nurses had various questions about how quickly the baby’s vitals returned to normal after different procedures. Then, they would try to modify the way they would do these procedures, in order to have less detrimental impact on these children.
It was a really neat study to see. To see the technology that I had worked on, helping at this Children’s Hospital. This was something I could see in person.
Caterpillar Tractor, on the other hand, has huge machines working all over the world. They are transmitting 60 times/second readings, from hundreds of sensors, back to Peoria. They can then analyse how these products are being used, detect when they might run into a problem, and do maintenance before it has a problem.
When I started working in software, I never would have guessed that I would be working on software which would be used in either of these sorts of situations. It is very exciting to see these things.