Drowning in data? Computer scientists at the National Institute of Standards and Technology (NIST) just released broad specifications to improve approaches for analyzing very large quantities of data by building more widely useful technical tools for the job. The framework supports the creation of tools that can be used in any computing environment.
Following a multiyear effort, NIST published the final version of the NIST Big Data Interoperability Framework, a collaboration between NIST and more than 800 experts from industry, academia and government. Filling nine volumes, the framework aims to guide developers on how to deploy software tools that can analyze data using any type of computing platform, from a single laptop to the most powerful cloud-based platform. Just as important, the framework advises analysts on how they can move their work from one platform to another and substitute a more advanced algorithm — without retooling the computing environment.
“We want to enable data scientists to do effective work using whatever platform they choose or have available, and however their operation grows or changes,” said Wo Chang, a NIST computer scientist and convener of one of the collaboration’s working groups. “This framework is a reference for how to create an ‘agnostic’ environment for tool creation. If software vendors use the framework’s guidelines when developing analytical tools, then analysts’ results can flow uninterruptedly, even as their goals change and technology advances.”
The framework fills a long-standing essential among data scientists, who must extract meaning from increasingly larger and more varied datasets while navigating an evolving technology ecosystem, NIST said. Interoperability is more important than ever as these huge amounts of data pour in from a growing number of platforms, ranging from telescopes, cameras and physics experiments to the countless tiny sensors and devices linked into the internet of things. “Several years ago the world was generating 2.5 exabytes (billion billion bytes) of data each day, that number is predicted to reach 463 exabytes daily by 2025. This is more data than would fit on 212 million DVDs,” NIST explained.
In “big data analytics” computer scientists examine very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes. The complexity and size of the data requires using advanced technology to uncover hidden patterns, correlations and other insights. Boosted by specialized analytics systems and software, as well as high-powered computing systems, big data analytics offers many benefits to businesses, including better-informed decision-making, income opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages over rivals.
With the rapid growth of tool availability, data scientists now have the option of scaling up their work from a single, small desktop computing setup to a large, distributed cloud-based environment with many processor nodes. However, often, this shift places enormous demands on the analyst. For example, tools may have to be reengineered from scratch using a different computer language or algorithm, costing staff time and potentially sacrificing time-critical insights.
The framework is an effort to help address these problems. As with the draft versions of the framework NIST released previously, the final includes consensus definitions and taxonomies to help ensure developers are on the same page when they discuss plans for new tools. It adds key requirements in integrating data security and privacy protections for these tools. The final version also offers a new reference architecture interface specification that will guide these tools’ actual deployment, NIST said.
“The reference architecture interface specification will enable vendors to build flexible environments that any tool can operate in,” Chang said. “Before, there was no specification on how to create interoperable solutions. Now they will know how.”
This interoperability could help analysts better address a number of data-intensive contemporary problems, such as weather forecasting. For example, meteorologists section the atmosphere into small blocks and apply analytics models to each block, using big data techniques to keep track of changes that hint at the future. As these blocks get smaller and the ability to analyze finer details grows, forecasts can improve — if more advanced computational components and more advanced modeling tools are implemented.
“You model these cubes with multiple equations whose variables move in parallel,” Chang said. “It’s hard to keep track of them all. The agnostic environment of the framework means a meteorologist can swap in improvements to an existing model. It will give forecasters a lot of flexibility.”
NIST said another potential application is drug discovery, where scientists must explore the behavior of multiple candidate drug proteins in one round of tests and then feed the results back into the next round. Unlike weather forecasting, where an analytical tool must keep track of multiple variables that change simultaneously, the drug development process generates long strings of data where the changes occur in sequence. While this problem demands a different big data approach, it would still benefit from the ability to make changes easily, as drug development is a tedious time-consuming and expensive process.
Whether applied to one of these or other big-data-related problems — from spotting healthcare fraud to identifying animals from a DNA sample — the value of the framework will be in helping analysts speak to one another and more easily apply all the data tools they need to achieve their goals, NIST said.
“Performing analytics with the newest machine learning and AI techniques while still employing older statistical methods will all be possible,” Chang said. “Any of these approaches will work. The reference architecture will let you choose.”