The cloud paradigm is the driving force behind the digital transformation that is taking place across all latitudes in business. Every organization is now an ICT organization, and every moment wasted on doubt brings a business one step closer to closing.
Cloud-native development is the new normal, and its coexistence with legacy systems cloaks the true technical debt your company owes if it really wants to compete with best-in-class companies.
GCP, the complete environment
An updated pipeline for software development goes hand in hand with a correct pipeline for clean data generation and management. Google throws all its experience and support behind the Google Cloud Platform:
- data technologies such as BigQuery (with or without Crystal Reports), Vertex AI (previously Datalab), and Bigtable
- development tools like App engine and Pub/Sub
- platform virtualisation with Compute Engine and Kubernetes Engine
and many more exciting technologies that will help your business build the bridge that leads to a really modern Cloud-native microservice approach.
Google Cloud Platform is the reference platform for migrating a business towards a Cloud-native approach that will carry the company into the future.
Let’s look at two case studies of big companies migrating to the cloud with their data points, business intelligence, and digital transformation to better understand the process, starting with BigQuery, the reference platform for streaming analytics.
BigQuery as the foundation for streaming analytics
A point that differentiates today’s modern architectures from legacy options lives in the underlying streaming architecture. Optimizing large-scale analytics ingestion on Google Cloud is a primary task in most projects.
In this article, ‘large-scale’ means greater than 100,000 events per second, or having a total aggregate event payload size of over 100 MB per second.
The first step is collecting vast amounts of incoming log and analytics events (Google Cloud’s managed services fit this task perfectly) and then processing them for entry into a data warehouse, such as BigQuery.
Any architecture that allows ingestion of significant quantities of analytics data should consider which data need to be accessed in near real-time and which can be dealt with after a short delay, and split the data accordingly.
A segmented approach has three main benefits:
- log integrity
- cost reduction
- reserved query resources
Moving lower-priority logs to batch loading prevents them from impacting reserved query resources. You can see the reference architecture in this diagram:
In this architecture, data originates from two possible sources: analytics events (published to a Pub/Sub topic) and logs (collected using Cloud Logging).
Analytics events can be generated by your app’s services in Google Cloud or sent from remote clients. Ingesting these analytics events through Pub/Sub and then processing them in Dataflow provides a high-throughput system with low latency. The concepts of hot paths and cold paths are very important here.
Hot path
Some events need immediate analysis. For example, an event might indicate undesirable client behavior or bad actors. Cherry-pick such events from Pub/Sub by using an autoscaling Dataflow job and then send them directly to BigQuery. The Dataflow job can partition this data to ensure that the 100,000 rows per second limit per table is not reached. This also keeps queries performing well.
Cold path
Events that need to be tracked and analyzed on an hourly or daily basis, but never immediately, can be pushed to objects on Cloud Storage by Dataflow. Loads can be initiated from Cloud Storage into BigQuery by using the Cloud Console, the gcloud command-line tools, or even a simple script, and can be merged into the same tables as the hot path events.
Like the logging cold path, batch-loaded analytics events do not impact reserved query resources, and keep the streaming ingest path load reasonable.
Some BigQuery features have been very important in the two case studies that you are going to read: Bending Spoons and JobRapido.
Case 1: Bending Spoons, the billion data points company
Who are Bending Spoons, and how do they use data?
Bending Spoons is an Italian software house that specializes in world-class mobile applications.
“Every day, we collect information from millions of active users of our apps. The goal is to scale up our data science to an industrial level in order to make the most of the data we gather and make strategic business decisions”, states Marco Meneghelli, Head of the Data Science and Analytics Team at Bending Spoons.
The number of downloaded apps from the Bending Spoons portfolio is really impressive: more than 400 million downloads, well over 12 million users (about half of these in North America) are the figures amassed as of June 2021.
The estimates for the number of data points to be checked lie around the one billion mark. These numbers are rapidly increasing, which demands best in class solutions.
In order to make data-driven choices using its own internal optimization tools, and to grow the business, Bending Spoons needed to analyze large volumes of data quickly. The company looked for a way to implement a powerful data storage and analysis system that didn’t require specialist technical support. BigQuery on Google Cloud Platform (GCP) was the perfect fit.
Analytics for internal data tools
BigQuery is a great choice at entry-level, because it is easy to set up data storage and automated queries. As well as using the GCP to host some apps, alongside Compute Engine and App Engine, Bending Spoons DevOps started using Kubernetes Engine to run certain data tasks.
Feeding models
Most of the computational heavy lifting is managed by Crystal, another internal tool connected to BigQuery. Crystal collects hundreds of millions of data points every day, from all Bending Spoons’ apps and numerous external sources, then feeds them to a number of models.
This analysis of the app market offers an insightful view of trends and competition, tracks the performance of the company’s own apps, and identifies opportunities for growth.
Improving system operation routines
The next steps are already on track. Bending Spoons plans to focus on improving systems operation routines, and is looking at starting to use Cloud Pub/Sub to stream its events data, as well as Vertex AI.
Case 2: Jobrapido – Jobseeking on >125M events per day
Who are Jobrapido, and how do they use data?
A key ingredient in Jobrapido’s success is the way it pairs job-seekers with opportunities. The company has invested in, and developed a search engine that relies on the type of job rather than specific job titles by focusing its logic on the taxonomy of jobs instead of exact text matches.
Scaling pain
Jobrapido developed a unique matching logic with focus on data-driven operations. They relied on their own server infrastructure, but certain types of analysis were limited by space, cost, and time. Moreover, their engineers soon found they were spending more time maintaining than developing new insights.
“Scaling had become a pain. We had availability issues and were running out of storage space”, says Stefano Fornari, VP of Engineering at Jobrapido. “With Google Cloud Platform, we can let Google handle those problems while our engineers can go back to focusing on the business logic instead of fussing over the details of infrastructure.”
From on premises to hybrid to full cloud
The company moved its backoffice to the Cloud in two steps, saving money and time and improving focus and efficiency.
Jobrapido took its first steps with GCP in September 2017, building a job advert classifier with App Engine and Cloud Pub/Sub. The classifier took in data from the millions of job adverts processed every day, assigned each one a label based on Jobrapido’s taxonomy model, and fed the results back into the main database.
The company’s second step was to begin working on moving its data infrastructure to the Cloud. BigQuery was at the core of the new platform, providing a serverless, easy-to-use way of storing over one hundred tables, the largest of which contains over four billion rows.
Along with Vertex AI, App Engine, and Pub/Sub, the company used BigQuery to form a data pipeline that ingests raw data, processes them, and then aggregates and analyzes what it has processed for further business insight.
B.I. analysis time cut by a factor of almost fifty
The new approach is incredibly efficient and saves a lot of money. “I’d say it’s reduced our overall costs by around half when you take into account the servers, the licenses and the time saved”, says Stefano Fornari, VP of Engineering at Jobrapido.
Talking specifically about business intelligence, BI analysis time was cut from 24 hours to 30 minutes – that reduced by a factor of almost fifty, or a reduction of about 98%.
Conclusions
Google Cloud Platform is the answer to modern software development and legacy migration. Using an advanced platform makes your company attractive to the best talent: hiring such talent will keep your company ahead of the competition for decades. Streaming analytics’ performances are a key differential in today’s business environment.
It is important to obtain direct certifications if you are an experienced software professional. Becoming a Google Cloud Certified Professional is the best way to demonstrate possession of the theoretical and practical knowledge necessary for the design, development, management and administration of the application infrastructure and data solutions on Google Cloud technology.
Following an initial assessment, instructors will provide you with the best possible preparation to help you pass your certification exam.
Focus on becoming an ACE (Associate Cloud Engineer), Professional Cloud Architect (PCA), or Professional Data Engineer (PDE). What’s your path? Sign up for Google Cloud Pro!