Architecting Networks for the Big Data Lake
With so much data to analyze, how do you build a network architecture to access it all?
It should be common knowledge by now that Big Data should not only be big, but fast as well. This means network architecture is under the gun to produce greater connectivity and more streamlined traffic flows in the data lake than has ever been achieved in the data center.
Naturally, this has a lot of data scientists and architects worried – not that such functionality is impossible, since it already exists in HPC infrastructure – but that getting to that level will put an enormous strain on the IT budget. With the data center already tasked to do more with less, how much time and effort will the front office be willing to invest in Big Data considering its real value to the business process is still largely theoretical?
According to Enterprise Storage Forum’s Christine Taylor, Hadoop-based Big Data deployments to date utilize DAS architectures to keep data as close as possible to processing centers. This cuts down on latency by removing the complex networking required of centralized data stores, but it also makes it more difficult to share data between Hadoop clusters, particularly as the lake scales into the multi-petabyte range. The HDFS file system is very adept at pulling data from various nodes, but extreme traffic levels call for highly specialized and intricate fabric architectures, such as Google’s Colossus – which, again, doesn’t come cheap.
This conflict between data locality and scale/efficiency is likely to remain one of the key challenges for Hadoop going forward, says Hedvig CEO Avinash Lakshman. As he explained to InfoStor recently, tools like MapReduce can effectively manage DAS architectures throughout the lake, but it results in islands of Big Data storage that work against the goal of quickly and easily comparing vast data sets against each other to gain insight. While it may be tempting to build nodes using hyperconverged infrastructure, this can result in heavy resource contention. A better bet is to run Hadoop on a dedicated application tier, put data on a dedicated storage tier, and then apply liberal doses of caching, tiering, dedupe and compression to minimize network latency. It also helps to use a parallel storage platform to avoid bottlenecks at single or dual controller points.
In the very near future, however, Big Data will be able to take advantage of neural networking to solve many of these challenges, says enterprise architect Mitch De Felice. With its ability to classify data on its own, plus perform its own analytics and forecasting, the neural network opens up a wealth of possibilities for Big Data, including object and image recognition, language processing (no more typing or clicking of commands; you just speak them) and pattern recognition. Most importantly, these networks are able to learn as they process data, so they provide an extremely organic means of adapting themselves to particular workflows and system requirements to provide optimal results at increasingly faster rates – in short, the more you use it, the better it functions.
At the moment, however, enterprises are turning toward cloud-based Big Data service platforms to start generating analytics results without the time and expense of building infrastructure themselves. This pushes the network challenge to a new level because performance is gauged not by the speed of the internal network is but how quickly clients can get data in and out. BDaaS providers like BlueData are addressing this problem with container-based solutions that allow some workloads to be off-loaded from the lake without actually pulling any data. The company’s EPIC platform utilizes the Open vSwitch solution to create multi-host networks that provide users with isolated VXLAN connections. This allows traffic to and from Docker containers to be routed through separate hosts rather than a single controller, improving scalability and performance across the board. At the same time, each container’s local storage is enhanced with root and data directory information to minimize overhead and simplify management of both Hadoop and Spark services.
In true enterprise fashion, for every problem encountered in the data lake there is either a direct solution or a workaround. Most Big Data platforms will actually start pretty small, however, so it’s important to ensure that today’s solution does not lead to bigger problems as workloads start to scale.
But ultimately, all of this effort will only be worthwhile if the system produces actionable intelligence for the enterprise. And that will not happen as the result of advanced networking or design, but from human ingenuity.
Arthur Cole covers networking and the data center for Enterprise Networking Planet and IT Business Edge. He has served as editor of numerous publications covering everything from audio/video production and distribution, multimedia and the Internet to video gaming.