Welcome!

Java IoT Authors: Pat Romanski, Liz McMillan, Yeshim Deniz, Elizabeth White, Zakia Bouachraoui

Blog Feed Post

Preparing Big Data for Analysis in R

by Yaniv Mor, Co-founder & CEO of Xplenty How do you get Big Data ready for R? Gigabytes or terabytes of raw data may need to be combined, cleaned, and aggregated before they can be analyzed. Processing such large amounts of data used to require installing Hadoop on a cluster of servers, not to mention coding MapReduce jobs in Pig or Java. Those days are over. This post is going to show how raw data can be prepared for analysis in R without any code or server installations. Instead, we’ll use Xplenty’s data integration-as-a-service to design a data flow, create a cluster, and run the job all via a friendly user interface. For this demo we’ll use 1.5 GB of raw web logs (uncompressed) from the servers that hosted the ”Star Wars Kid” video. A remix of the video was also hosted there as well as the usual affair of HTMLs, images, and more. Here’s an example log line: 208.63.63.94 - - [11/Apr/2003:12:36:39 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 28361 "http://www.kottke.org/" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)" Log line format: Source IP/domain User Identifier (blank) UserID (blank) Date - in the format of dd/MMM/yyyy:HH:mm:ss Z HTTP request - type, URL, HTTP version HTTP code Bytes transferred Referrer User agent Let’s say we would only like to analyze requests to the original “Star Wars Kid” video by source IP, date and referrer. Imagine what it would be like to setup the servers and write the code - the hours spent writing and debugging a relatively simple dataflow. Feel the stress building? Let it go. Here’s how such a dataflow looks like in Xplenty: Let’s take a closer look how it works: Source - loads the data from Amazon S3 and splits it into fields. The data is publicly available on S3 at xplenty.public/weblogs/star_wars_kid.log.gz. If you’d like to take a look at the data, download it via the web, or create an AWS account and use a tool such as S3Browser to access the above path. Select - only keeps the ip, date, url, and referrer fields while leaving the rest of the data out. Note that the date also contains the time, and that the request also contains the request type and HTTP version. They are both cleaned in the select component using a regular expression. Filter - matches Star_Wars_Kid.wmv in the URL field and removes any other log lines. Destination - stores the results back into Amazon S3.  No setup or installation is needed. Just a few clicks enables you to create a new cluster. Then, one more screen to get the job running. The results - about 120 MB (uncompressed) log lines of video file requests with IPs, URLs, and referrers that are now ready for analysis. Job running time - about 3 minutes. The full results are available in the xplenty.dumpster bucket at starwarskid/videos.gz. Here are a few sample lines: 66.142.89.235 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/ 63.195.36.218 09/May/2003 /random/video/Star_Wars_Kid.wmv - 66.27.235.199 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.kuro5hin.org/story/2003/5/2/16116/46048 24.81.67.79 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/archive/2003/04/29/star_war.shtml 12.149.141.14 09/May/2003 /random/video/Star_Wars_Kid.wmv http://www.waxy.org/ Now, we can finally analyze the data in R. Here’s sample code which generates a traffic graph by date for Star_Wars_Kid.wmv: df <- read.table('star-wars-kid.tsv', fill = TRUE) colnames(df) >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

IoT & Smart Cities Stories
A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to great conferences, helping you discover new conferences and increase your return on investment.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
SYS-CON Events announced today that Silicon India has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Published in Silicon Valley, Silicon India magazine is the premiere platform for CIOs to discuss their innovative enterprise solutions and allows IT vendors to learn about new solutions that can help grow their business.
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT staff augmentation services for software technology providers. By providing clients with unparalleled niche technology expertise and industry experience, Chetu has become the premiere long-term, back-end software development partner for start-ups, SMBs, and Fortune 500 companies. Chetu is headquartered in Plantation, Florida, with thirteen offices throughout the U.S. and abroad.
DXWorldEXPO LLC announced today that "IoT Now" was named media sponsor of CloudEXPO | DXWorldEXPO 2018 New York, which will take place on November 11-13, 2018 in New York City, NY. IoT Now explores the evolving opportunities and challenges facing CSPs, and it passes on some lessons learned from those who have taken the first steps in next-gen IoT services.
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by FinTechEXPO. ICOHOLDER gives detailed information and help the community to invest in the trusty projects. Miami Blockchain Event by FinTechEXPO has opened its Call for Papers. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Miami Blockchain Event by FinTechEXPOalso offers sp...