Welcome!

Java IoT Authors: Liz McMillan, Elizabeth White, Pat Romanski, Yeshim Deniz, Frank Lupo

Related Topics: @BigDataExpo, Java IoT, @CloudExpo

@BigDataExpo: Article

Big Data, Speed and Efficiency | @CloudExpo #BigData #MachineLearning

Here's how two part-time DBAs maintain mobile app ad platform Tapjoy’s massive data needs

The next BriefingsDirect Voice of the Customer big data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

Examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs with extreme efficiency.

To learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty we're joined by David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.

Abercrombie

We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.

The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.

Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff --  patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?

The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in:

More Stories By Dana Gardner

At Interarbor Solutions, we create the analysis and in-depth podcasts on enterprise software and cloud trends that help fuel the social media revolution. As a veteran IT analyst, Dana Gardner moderates discussions and interviews get to the meat of the hottest technology topics. We define and forecast the business productivity effects of enterprise infrastructure, SOA and cloud advances. Our social media vehicles become conversational platforms, powerfully distributed via the BriefingsDirect Network of online media partners like ZDNet and IT-Director.com. As founder and principal analyst at Interarbor Solutions, Dana Gardner created BriefingsDirect to give online readers and listeners in-depth and direct access to the brightest thought leaders on IT. Our twice-monthly BriefingsDirect Analyst Insights Edition podcasts examine the latest IT news with a panel of analysts and guests. Our sponsored discussions provide a unique, deep-dive focus on specific industry problems and the latest solutions. This podcast equivalent of an analyst briefing session -- made available as a podcast/transcript/blog to any interested viewer and search engine seeker -- breaks the mold on closed knowledge. These informational podcasts jump-start conversational evangelism, drive traffic to lead generation campaigns, and produce strong SEO returns. Interarbor Solutions provides fresh and creative thinking on IT, SOA, cloud and social media strategies based on the power of thoughtful content, made freely and easily available to proactive seekers of insights and information. As a result, marketers and branding professionals can communicate inexpensively with self-qualifiying readers/listeners in discreet market segments. BriefingsDirect podcasts hosted by Dana Gardner: Full turnkey planning, moderatiing, producing, hosting, and distribution via blogs and IT media partners of essential IT knowledge and understanding.

@ThingsExpo Stories
As businesses evolve, they need technology that is simple to help them succeed today and flexible enough to help them build for tomorrow. Chrome is fit for the workplace of the future — providing a secure, consistent user experience across a range of devices that can be used anywhere. In her session at 21st Cloud Expo, Vidya Nagarajan, a Senior Product Manager at Google, will take a look at various options as to how ChromeOS can be leveraged to interact with people on the devices, and formats th...
SYS-CON Events announced today that Yuasa System will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Yuasa System is introducing a multi-purpose endurance testing system for flexible displays, OLED devices, flexible substrates, flat cables, and films in smartphones, wearables, automobiles, and healthcare.
SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/.
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
As popularity of the smart home is growing and continues to go mainstream, technological factors play a greater role. The IoT protocol houses the interoperability battery consumption, security, and configuration of a smart home device, and it can be difficult for companies to choose the right kind for their product. For both DIY and professionally installed smart homes, developers need to consider each of these elements for their product to be successful in the market and current smart homes.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...