Welcome!

Java IoT Authors: Elizabeth White, APM Blog, Liz McMillan, Stackify Blog, Yeshim Deniz

Related Topics: @DXWorldExpo, Java IoT, @CloudExpo

@DXWorldExpo: Article

Big Data, Speed and Efficiency | @CloudExpo #BigData #MachineLearning

Here's how two part-time DBAs maintain mobile app ad platform Tapjoy’s massive data needs

The next BriefingsDirect Voice of the Customer big data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

Examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs with extreme efficiency.

To learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty we're joined by David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.

Abercrombie

We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.

The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.

Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff --  patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?

The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in:

More Stories By Dana Gardner

At Interarbor Solutions, we create the analysis and in-depth podcasts on enterprise software and cloud trends that help fuel the social media revolution. As a veteran IT analyst, Dana Gardner moderates discussions and interviews get to the meat of the hottest technology topics. We define and forecast the business productivity effects of enterprise infrastructure, SOA and cloud advances. Our social media vehicles become conversational platforms, powerfully distributed via the BriefingsDirect Network of online media partners like ZDNet and IT-Director.com. As founder and principal analyst at Interarbor Solutions, Dana Gardner created BriefingsDirect to give online readers and listeners in-depth and direct access to the brightest thought leaders on IT. Our twice-monthly BriefingsDirect Analyst Insights Edition podcasts examine the latest IT news with a panel of analysts and guests. Our sponsored discussions provide a unique, deep-dive focus on specific industry problems and the latest solutions. This podcast equivalent of an analyst briefing session -- made available as a podcast/transcript/blog to any interested viewer and search engine seeker -- breaks the mold on closed knowledge. These informational podcasts jump-start conversational evangelism, drive traffic to lead generation campaigns, and produce strong SEO returns. Interarbor Solutions provides fresh and creative thinking on IT, SOA, cloud and social media strategies based on the power of thoughtful content, made freely and easily available to proactive seekers of insights and information. As a result, marketers and branding professionals can communicate inexpensively with self-qualifiying readers/listeners in discreet market segments. BriefingsDirect podcasts hosted by Dana Gardner: Full turnkey planning, moderatiing, producing, hosting, and distribution via blogs and IT media partners of essential IT knowledge and understanding.

@ThingsExpo Stories
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things’). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing? IoT is not about the devices, it’s about the data consumed and generated. The devices are tools, mechanisms, conduits. In his session at Internet of Things at Cloud Expo | DXWor...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution. In his session at @ThingsExpo, Akvelon expert and IoT industry leader Sergey Grebnov provided an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
SYS-CON Events announced today that Google Cloud has been named “Keynote Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Companies come to Google Cloud to transform their businesses. Google Cloud’s comprehensive portfolio – from infrastructure to apps to devices – helps enterprises innovate faster, scale smarter, stay secure, and do more with data than ever before.