OpenGov Asia had the opportunity to attend the very first Strata + Hadoop World Conference in Beijing in early August 2016 and took some time to speak to Doug Cutting, one of the speakers at the conference. Hadoop is an open source software framework for distributed storage and distributed processing of very large data sets on computers built from commodity hardware. We find out from the founder himself about his milestones in developing Hadoop and what is on the horizon.
Hadoop celebrates its 10th birthday this year. As the founder of Hadoop, what do you think are the most significant milestones for you?
I think there are a number of significant milestones. It really started before it was even called Hadoop, when we first founded the project. 10 years ago, in 2006, Yahoo! decided to invest in the technology – that was the first real significant milestone. Then we renamed those parts that matched to become Hadoop; so that was the birth of Hadoop right there in 2006.
About a year later, we had something useable by a lot of people, something that really did scale and was reliable enough. It was a huge milestone in 2007 when we could run it on a thousand nodes and it would stay up for weeks at a time, and people could really process petabytes with open source software.
I think the subsequent milestones were really the arrival of more technologies on top of Hadoop. The initial ones were pig, hive and Edge-Base, and using MapReduce as the engine but providing other tools.
From that point onwards, it’s hard to pick another milestone. Spark has been a huge one in that it shows that we can really start replacing Hadoop with other technologies. I think that shows us that this ecosystem is something that can evolve over time and continue to improve in fundamental ways with significant new execution engines, new storage engines, and new schedulers. We’ve got another scheduler besides yarn in Hadoop, called mesos. I think this competition is very healthy – this means that the ecosystem can continue to progress in its technical, political and economic architecture.
If you think about the relational database, it didn’t evolve very quickly and I think that was because it was controlled by companies that sold the databases and they would undercut their business if they embraced a different technology. They couldn’t afford to change the technology because that would change their business, creating an innovator’s dilemma. Once you have a successful business, it’s very hard to change because it will undermine your successful business. Fortunately, in open source, that doesn’t exist and Spark really demonstrated that first to me very clearly; that this platform can progress and companies like Cloudera are not left behind.
Cloudera is the first vendor to adopt Spark and today, we are still the largest supporter of Spark. People always ask, “Is Spark going to replace Hadoop?”, as though Spark is something we should be afraid of. Instead, Spark is something we very much embrace. It is beginning to replace Hadoop but not completely – it adds more capabilities, and we are also adding search capabilities with solar, and interactive sequel with impala and storage capabilities with kudu. Through this, we keep seeing this ecosystem get more powerful and stronger. Thus, I think that Spark is a milestone that really demonstrates that the ecosystem has the capacity to grow and become stronger.
More recently, it’s hard to find a specific milestone but it would probably be the degree to which Hadoop and the Hadoop ecosystem are becoming the standard for data processing across industries. We see an incredible spectrum of customers from banks to telcos, retail, and even those who are making airplanes, cars, tractors and curing diseases.
This is something I never expected and I think even when we started Cloudera, we weren’t even sure that was going to happen. Now, it is really happening.
What can we expect from the Hadoop technology in the next 3-5 years’ time from your observations and what has been happening so far?
I think Hadoop is becoming more of a style of data processing and for building technology, where we can have this loosely coupled ecosystem of projects and data that we load before we apply a schema.
I think Hadoop was the first project to really set this style in motion and now we’re seeing that as being the most successful approach, to have a collection of open source projects, and to permit people to experiment and explore their data as a style.
The old style was what we call the waterfall technique, where you first do all of the design and once you are done with the design, then you begin the implementation. With Hadoop, we encourage people to experiment, and to build a very early prototype because they can afford to. We empower them with the tools and they can then evolve all the layers of the application as they see it working. This is a much more effective approach as they can change the application and improve it over time.
Share with us about your role as Chief Architect of Cloudera.
I do a number of different things – I help advise folks building the technology, so I talk to engineers and then the executives about where we’re going. I also work as an ambassador for the open source technologies, travelling as I am today and talking to people about how this open source ecosystem operates.
I also help out with the company strategy; how we can best use the open source methodology, what new technologies should we be building, essentially helping to start new initiatives and new investments in open source.
Cloudera, in many ways, is like an open source project. In open source projects, at least in Apache open source ones, you don’t have leaders, rather, you have contributors and Cloudera works that way as well. We do have some leaders, but technically, we have a lot of contributors and so I try to help out in all our processes.
We just saw the launch of BASE in China, which was also launched in Singapore and Malaysia earlier this year. What are your thoughts on BASE and why do you think governments and schools should come aboard to join the initiative?
A major driver of the new economy now is technology and in particular, data technology. If you want your economy to be successful, and your citizens to have better lives with good jobs (and the right jobs), then you want them to be involved in technology and have the right skills to stay on par.
Manufacturing is no longer about people holding tools, it is more about people programming robots to do automated and repetitive tasks, and this is happening in most industries. They are using more and more IT, and the skills that are most in-demand are software and technical skills.
I think governments recognise that, and so we, as a company, contribute to the ecosystem from the beginning with a model that trains people and then places them with the right jobs within the ecosystem as well. We adopted this model so that there will be people who understand the technology, which in turn helps to make our business more meaningful. This ends up benefiting industries and economies at large.
Therefore, we are happy to encourage and help countries build this kind of knowledge in their citizens that will help the country progress as a whole. The Big Analytics Skills Enablement (BASE) initiative is great because it involves not just the educational institutions, but also governments, as well as recruitment and placement organisations to form a holistic ecosystem, much like the open source model.
As an open source software evangelist, how do you think the transition from legacy systems to open source based ones for organisations can be made easier?
Our business is not focused on replacing existing systems. Existing systems tend to work well but they tend to be based on data that was entered on keyboards, which is the traditional database system. What we help people with is the building new systems with machine-generated data. Originally, Hadoop was very popular for web-search engines and web advertising systems, analysing the logs from web servers. These are all machine-generated data and now, we’re seeing even more of this.
Automobiles have sensors on them, airplanes have sensors on them, even hospital beds have sensors; we’re seeing more of it as Internet of Things (IoT) spreads and it is a new kind of data that people don’t have the tools to store and analyse. But Hadoop lets them do that, allowing them to find a lot of value in their data.
Sometimes we get involved in legacy data or importing legacy data. The legacy data system will continue to run, but they want to use the data from that system to combine it with data from other new systems to fully understand their business and their customers. So we work a lot on import and export to legacy systems, and sometimes people will replace them because the costs of the legacy systems are becoming too expensive as their business grows. The cost of open source technologies are so much lower.
But most of our business is actually new applications in companies.
In terms of public sector or government-related projects which make use of big data and Hadoop, are there any examples that come to your mind?
Some of the earlier ones we have seen are tax agencies and government services, that use big data and Hadoop to prevent fraud, making sure that people are paying their taxes and that the people who are collecting money from the government are not getting more than they should be.
Also, several government agencies are interested in applying big data analytics in healthcare, military and intelligence businesses. They have massive volumes of data and are using technologies like Hadoop and Cloudera to help them manage and understand that data, securely.
There are several other applications for government use, and we’re starting to see more in managing utilities, traffic, and other government services. There are a lot of opportunities for government services to use data to better understand how systems are being used and how to improve them for the well-being of the community at large.