Time to rethink the who, what, where, why and how of big data.
It is probably time to rethink the who, what, where, why and how of big data. There has been a surge of important news in the past couple weeks, where we are approaching a period of relative calm and can finally assess how the space has evolved in the past year. Here are the top five trends shaping up that should change almost everything about big data in the near future, including how it’s done, who’s doing it and where it’s consumed.
The democratization of data science
The amount of effort being put into broadening the talent pool for data scientists might be the most important change of all in the world of data. In some cases, it’s new education platforms (e.g., Coursera and Udacity) teaching students fundamental skills in everything from basic statistics to natural language processing and machine learning. Elsewhere, it’s products such as 0xdata that aim to simplify and add scale to well-known statistical-analysis tools such as R, or, like Quid that try to mask the finer points of concepts such as machine learning and artificial intelligence behind well-designed user interfaces and slick visual representations. Platforms such as Kaggle have opened the door to crowdsourcing answers to tough predictive-modeling problems.
Whatever the avenue, though, the end result is that individuals who have a little imagination, some basic computer science skills and a lot of business acumen can now do more with their data. A few steps down the ladder, companies such as Datahero, Infogram and Statwing are trying to make analytics accessible even to laypersons. Ultimately, all of this could result in a self-feeding cycle where more people start small, eventually work their way up to using and building advanced data-analysis products and techniques, and then equip the next generation of aspiring data scientists with the next generation of data applications.
Hadoop’s MapReduce reduction
Hadoop’s days as a platform solely for performing MapReduce jobs are officially over, and the change couldn’t have come fast enough. The evolution began with Apache Hadoop version 2.o and itsnew YARN functionality that allows for new processing frameworks, but solidified with the spate of projects and products — including Cloudera’s very popular commercial distribution — that now include a SQL query engine or other method for interactive analysis running alongside MapReduce. That was a big item to check off the list of capabilities Hadoop must support, as data analysts need access to Hadoop data in a manner they understand.
From this point on — like with the Google MapReduce framework on which Hadoop’s version of MapReduce was modeled — it seems likely we’ll see the latter grow less important. Presumably, the Hadoop community will focus more on using the platform’s distributed nature to support real-time processing and other new capabilities that make Hadoop a better fit in next-generation data applications. If Hadoop can’t fill the void, there are plenty of people working on other technologies — Storm and Druid, for example — that will gladly do so.
The HBase NoSQL database that’s built atop the Hadoop Distributed File System is a good example of what’s possible when Hadoop is freed from the MapReduce constraints. Large web companies such as Facebook and eBay already use HBase to power transactional applications, and startups such as Drawn to Scale and Splice Machine have used HBase as the foundation for transactional SQL databases. More new products and projects, such as graph database Giraph, will look for ways to leverage HDFS because it gives them a file system that’s scalable, free, relatively mature and, perhaps most importantly, tied into the ever-growing Hadoop ecosystem.
Coming soon to an app near you
Of course, all of this technological improvement is nothing without applications to take advantage of it, so it’s good news that we’re seeing a wide range of approaches for making this happen. One of these approaches is making big data accessible to developers, which is where startups such asContinuuity, Infochimps and even Precog (a big data BI engine, by nature) come into play. They make it relatively easy for developers to create applications that tie at least some functions into a big data backend, sometimes via a process as simple as writing a script or generating a piece of code that programmers can insert directly into their application’s code.
Another approach that’s picking up steam is simply to find a use case for big data –analyzing user behavior, network security, artificial intelligence, customer service — and turn it into a product or service that companies can buy and start using out of the box. These are things that early adopters such as Google, Facebook and others have had to build themselves but that others likely won’t have to. And everywhere you look, big data and data science are already being rolled into many web and mobile applications, from deciding which products to buy to figuring out your long lost relatives. Somewhere, somehow, everyone surfing the web or using a mobile app is benefiting from big data.
Machine learning is everywhere
Machine learning has had something of a coming-out party in the past year and is now so prevalent it might be easy to mistake it for something that’s not difficult to do well. It’s easy to see why machine learning is so popular, though: In an age where consumers (and advertisers) want more personalization, and where computer systems are overwhelmed with data flying at them from all different directions, the prospect of writing models that continuously discover patterns among potentially countless data points has to be appealing.
Here’s a small sample of apps you’ve likely heard of, or that we’ve covered, that rely machine learning to work their magic: Prismatic, Summly, Trifacta, CloudFlare, Twitter, Google, Facebook,Bidgely, Healthrageous, Predilytics, BloomReach, DataPop, Gravity. I could go on for days, I think.
Now, it’s difficult to imagine a new tech company launching that doesn’t at least consider using machine learning models to make its product or service more intelligent. Heck, even Microsoftappears to be making a big bet on machine learning as the foundation of a new revenue stream. The technology to store and process lots of data is out there, and the brainpower looks to be coming along as well. Soon, there will be few excuses for building applications that don’t learn as they go, for example, what users want to see, how systems fail or when customers are about to cancel a service.
Mobile data as the engine for AI
Long before Skynet takes over and the machines turns on humans, our mobile phones will know better than us what we want to do. That’s because until technologies like Google’s Project Glassactually make their way into the wild, our phones and the apps on them are probably the richest source of personal data around. And thanks to machine learning, speech recognition and other technologies, they’re able to make a lot of sense of what they’re given.
They know where we go, who our friends are, what’s on our calendars and what we look at online. Thanks to a new generation of applications such as Siri, Saga and Google Now trying to serve as personal assistants, our phones can understand what we say, know the businesses we frequent and the foods we eat, and the hours we’re at home, at work or out on the town. Already, their developers claim such apps can augment our limited vantage point by automatically telling us the best directions to our upcoming appointment, or the best place to get our favorite foods in a city the app knows we haven’t been to before.
The race is officially on to see who can build the smartest app, pull in the most data sources and figure out how to best display it all on a 4-inch screen.