Java IoT Authors: Pat Romanski, Yeshim Deniz, Elizabeth White, Roger Strukhoff, Liz McMillan

Related Topics: @DXWorldExpo, Java IoT, @CloudExpo

@DXWorldExpo: Article

Big Data, Speed and Efficiency | @CloudExpo #BigData #MachineLearning

Here's how two part-time DBAs maintain mobile app ad platform Tapjoy’s massive data needs

The next BriefingsDirect Voice of the Customer big data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

Examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs with extreme efficiency.

To learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty we're joined by David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.


We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.

The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.

Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff --  patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?

The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in:

More Stories By Dana Gardner

At Interarbor Solutions, we create the analysis and in-depth podcasts on enterprise software and cloud trends that help fuel the social media revolution. As a veteran IT analyst, Dana Gardner moderates discussions and interviews get to the meat of the hottest technology topics. We define and forecast the business productivity effects of enterprise infrastructure, SOA and cloud advances. Our social media vehicles become conversational platforms, powerfully distributed via the BriefingsDirect Network of online media partners like ZDNet and IT-Director.com. As founder and principal analyst at Interarbor Solutions, Dana Gardner created BriefingsDirect to give online readers and listeners in-depth and direct access to the brightest thought leaders on IT. Our twice-monthly BriefingsDirect Analyst Insights Edition podcasts examine the latest IT news with a panel of analysts and guests. Our sponsored discussions provide a unique, deep-dive focus on specific industry problems and the latest solutions. This podcast equivalent of an analyst briefing session -- made available as a podcast/transcript/blog to any interested viewer and search engine seeker -- breaks the mold on closed knowledge. These informational podcasts jump-start conversational evangelism, drive traffic to lead generation campaigns, and produce strong SEO returns. Interarbor Solutions provides fresh and creative thinking on IT, SOA, cloud and social media strategies based on the power of thoughtful content, made freely and easily available to proactive seekers of insights and information. As a result, marketers and branding professionals can communicate inexpensively with self-qualifiying readers/listeners in discreet market segments. BriefingsDirect podcasts hosted by Dana Gardner: Full turnkey planning, moderatiing, producing, hosting, and distribution via blogs and IT media partners of essential IT knowledge and understanding.

@ThingsExpo Stories
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
DXWorldEXPO LLC announced today that "Miami Blockchain Event by FinTechEXPO" has announced that its Call for Papers is now open. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Financial enterprises in New York City, London, Singapore, and other world financial capitals are embracing a new generation of smart, automated FinTech that eliminates many cumbersome, slow, and expe...
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by FinTechEXPO. ICOHOLDER give you detailed information and help the community to invest in the trusty projects. Miami Blockchain Event by FinTechEXPO has opened its Call for Papers. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Miami Blockchain Event by FinTechEXPO also offers s...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...