Google Launches A Tool That Can Scale and Parallelize Neural Networks
Today, machine learning is used in all major industries like manufacturing, retail, healthcare, travel, financial services, and energy. Some of the top machine learning use cases include predictive maintenance and condition monitoring in manufacturing, dynamic pricing in travel, and upselling and cross-channel marketing in retail. In fact, according to Forbes, “57% of enterprise executives believe that the most critical growth benefit of AI and ML will be improving customer experiences and support. However, to enable machine learning initiatives, sophisticated infrastructure that is adaptable and that can quickly integrate and process large amounts of data from disparate sources is imperative. Data is often scattered across multiple data platforms, tools, applications and processing engines. Establishing and maintaining such an infrastructure can be complicated and costly.
It’s time for a New Approach
Many organisations are looking for new ways to store their data, often done via a data lake or data lake houses. The purpose of a Data Lake is to collect large volumes of data from multiple, disparate sources, including data of different types (both structured and unstructured) and store this data in its original format. Replication of data from the system of origin can be slow and costly and sometimes only a small subset of the relevant data will be stored in the data lake. To leverage this data for machine learning purposes, it first needs to be integrated. With the increasingly distributed nature of the data ecosystem, data integration gets more and more complex and harder to achieve in a reasonable time frame using more traditional methods. Data is typically distributed across a hybrid of cloud providers and on-premises systems making access and integration even more challenging. According to the Total Economic Impact (TEI) of Data Virtualization survey conducted by Forrester, Data Scientists spend about 30% of their time on data wrangling and data curation.
The Benefits of Leveraging Data Virtualization
An alternative approach to moving data from multiple source systems into a new, centralized repository, is data virtualization. It provides real-time, logical, consolidated views of data, without data replication and allows the data to remain at its origin. Data can reside on-premises or in the cloud and the data can be of differing types and structures. For Data Scientists it means more access to data in a truly self-service and flexible way. The Data Scientist no longer needs to be concerned about the technical complexities of the underlying data sources and how this data is joined or combined. The data virtualization layer hides this complexity but at the same time provides flexibility to model the data in different ways for different business requirements, including data science and advanced analytics purposes.
By providing a single access point for all corporate data assets, regardless of the location and format, Data Virtualization provides real data agility. Data Scientists and Data Engineers can apply functions on top of the physical data, to obtain different logical views of the same physical data, without the need to create additional physical copies of that source data. It offers a fast and inexpensive way to help to address many of the specific data challenges faced by Data Scientists when integrating data for machine learning purposes. Best-of-breed data virtualization tools offer a searchable Data Catalog, that includes extended metadata for each data set, such as tags, column descriptions and commentary, as well as active metadata like who uses what data set, when, and how. Data usage knowledge is key to better servicing the data needs of the business.
Keeping it Simple
Data virtualization offers a framework for clarity and simplicity to the data integration process. Data is everywhere, so regardless of whether data is stored in a relational database, a Hadoop cluster, a SaaS application, a multi-dimensional cube or a NoSQL system, data virtualization will serve up the data in a consistent way. By exposing the data according to a consistent model or data representation, avoids having to create pools of data and potentially obtaining different results. Data virtualization also promotes reusability. It is possible to clearly and cost-effectively separate the responsibility of the IT data architects/engineers and the data scientists. Leveraging data virtualization, reusable logical data sets can be developed to expose information in different ways and the data can be standardized as it is brought together.SEE ALSO

Digging The Tech Behind NASA’s Perseverance Rover
The Forrester TEI survey results have identified that data preparation tasks can be reduced by 67% allowing Data Science work to be accelerated. As the adoption of machine learning and artificial intelligence continues to grow, data lakes will become more prevalent and data virtualization will become increasingly more necessary for optimizing the productivity of data scientists and the initiatives they work on that rely heavily on data.
However, the biggest benefit will be from a data integration perspective where a considerable amount of time is spent. More time can be spent focused on the scientific methods for extracting actionable insights from data rather than being burdened with data engineering and management tasks. By simplifying the way in which data is accessed, data virtualization simplifies machine learning initiatives. This means that the entire organisation will enjoy the full benefits of cost-effectively gleaning real-time business insights.