Java IoT Authors: Liz McMillan, Yeshim Deniz, Zakia Bouachraoui, Elizabeth White, Pat Romanski

Related Topics: @DXWorldExpo, Java IoT, @CloudExpo

@DXWorldExpo: Blog Feed Post

Data Curation Systems By @JnanDash | @BigDataExpo #BigData

There is talk of a third generation of tools termed 'scalable data curation'

There is a whole area in the Data world, called by various names – data integration, data movement, data curation or cleaning, data transformation, etc. One of the pioneers is Informatica which came into being when Data Warehouse became a hot topic during the 1990s. The term ETL (extraction, transformation, loading) became part of the warehouse lexicon. If we call this the first generation of the data integration tools, then they did an adequate job for its time. Often the T of the ETL was the hardest job as it required business domain knowledge. Data were assembled from fewer source (usually less than 20) into the warehouse for offline analysis and reporting. The cost of data curation (mostly, data cleaning) required to get heterogeneous data into proper format for querying and analysis was high. During my years at Oracle in the mid-1990s, such tools were provided by third party companies. Often, many warehouse projects were substantially over-budget and late.

Then a second generation of ETL systems arrived where major ETL products were extended with data cleaning modules, additional adaptors to ingest other kinds of data, and data cleaning tools. Data curation involved: ingesting data sources, cleaning errors, transforming attributes into other ones, schema integration to connect disparate data sources, and performing entity consolidation to remove duplicates. But you need a professional programmer to handle all these. With the arrival of the Internet, many new sources of data also arrived and the diversity increased manyfold and the integration task became much tougher.

Now there is talk of a third generation of tools termed “scalable data curation” which can scale to hundreds or even thousands of data sources. Experts mention that such tools can use statistics and machine learning to make automatic decision wherever possible. Such tools need human interaction only when needed.

Start-ups such as Trifacta and Paxata emerged, applying such techniques to data preparation, an approach subsequently embraced by incumbents Informatica, IBM, and Solix. A new startup called TamR (cofounded by Mike Stonebraker of Ingres, Vertica, and VoltDB fame) which got funded last year by Google Ventures and NEA ($16M funding), claims to create a true “curation at scale”. It has adopted a similar approach but applied it to a different upstream problem – curating data from multiple sources.  IBM has publicly stated its direction to develop a “Big Match” capability for Big Data that would complement its MDM (master data management) tools. More are expected to enter into this effort.

In summary, ETL systems arose to deal with the transformation challenges in early data warehouses. They evolved into second generation data curation systems with an expanded scope of offerings. Now a new generation of data curation systems is emerging to address the Big Data world where sources have multiplied with more heterogeneity of data sources. On the surface, this seems quite opposite to the concept of “data lake” where native formats are stored. However, the so-called “data refinery” is no different than the curation process.

Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.

IoT & Smart Cities Stories
CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
Bill Schmarzo, Tech Chair of "Big Data | Analytics" of upcoming CloudEXPO | DXWorldEXPO New York (November 12-13, 2018, New York City) today announced the outline and schedule of the track. "The track has been designed in experience/degree order," said Schmarzo. "So, that folks who attend the entire track can leave the conference with some of the skills necessary to get their work done when they get back to their offices. It actually ties back to some work that I'm doing at the University of San...
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
IoT is rapidly becoming mainstream as more and more investments are made into the platforms and technology. As this movement continues to expand and gain momentum it creates a massive wall of noise that can be difficult to sift through. Unfortunately, this inevitably makes IoT less approachable for people to get started with and can hamper efforts to integrate this key technology into your own portfolio. There are so many connected products already in place today with many hundreds more on the h...
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
DXWorldEXPO LLC announced today that Telecom Reseller has been named "Media Sponsor" of CloudEXPO | DXWorldEXPO 2018 New York, which will take place on November 11-13, 2018 in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitoring and Cost Management … But How? Overwhelmingly, even as enterprises have adopted cloud computing and are expanding to multi-cloud computing, IT leaders remain concerned about how to monitor, manage and control costs across hybrid and multi-cloud deployments. It’s clear that traditional IT monitoring and management approaches, designed after all for on-premises data centers, are falling short in ...
The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...