What is Data Science?
To avoid any misunderstandings about the definition of data science, please note that we will reference the definition provided by MIT(1) whenever this term is mentioned:
“Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large datasets. It is closely related to the fields of data mining and machine learning, but broader in scope.“
Since data science coincides with many different disciplines, the use cases should provide some context as to how this scope can be narrowed to a specific end. The success stories are to provide clear examples of feasibility in real-world production.
Each case will be presented in a similar format that includes a real-world challenge, a data science-related solution for this challenge, and finally, how data science helped solve it.
Case 1: The Science of American Football
Our first use case examines the apparent connection between data science and sports success, using Amazon Web Services (AWS) is a deep well of documented solutions. The staff writer’s report was gathered using information passed by Elena Ehrlich, a data scientist at AWS.
The Challenge
The National Football League (NFL) has long since adapted metrics as a way to evaluate players, starting with the scouting combine that occurs before every draft. Ehrlich’s system is a natural evolution of simply evaluating the fastest players based on their dash times, and judging quarterbacks based on how well they can hit non-human targets.
The Science
According to Ehrlich, the Splice Binned-Pareto distribution (SBPD) method to “robustly and accurately model time-series with heavy tailed noise”. In probability, the term ‘heavy tailed’ refers to number sets with wild distribution, or high levels of randomness. These are the same type of distributions as such wide-ranging scenarios as weather patterns and anthropological studies covering countrywide populations.
The Solution
In football speak, their method of data analysis accounts for many more scenarios than the previous ways, including the various circumstances outside the game itself that affect player performance. The results are also presented empirically, and experimentation can occur continually and abundantly as long as NFL games are played.
SBPD was shown in the updated passer rating system used by the NFL, which supplanted the archaic QB rating used by the league’s official trackers. Their model still measures a player’s performance, only it accounts for more variations throughout different time periods, as well as the factors that contribute to these changes.
Of course, the implications go beyond just sports, since the same ideas can be applied to any wildly unpredictable series of events. This is particularly useful for predictive models in markets affected by wildly variable factors like product sales that are based on social media trends.
Recommended video: Why Most Data Science Projects Never Make it To Production
Case 2: Uber’s Revolutionary Ride Algorithms
As the premier transportation company in the world, Uber epitomizes the idea of a data science success story. This case study provides more insight into the specific methods that were used to manage intellectual property with virtually unlimited growth potential.
The Challenge
Uber faced the type of logistical challenges one might expect with the pool of consumer data alone. The necessity of multiple disciplines relating to big data becomes apparent from a single ride, which must estimate the driver’s ETA based on the user’s location while factoring in traffic and providing a fare. This requires geolocation data, personal and financial consumer data, and real-time traffic data working for millions of transactions per month.
The Science
Uber is secretive about the specifics of their operation, but many of their approach to data science can be inferred from an excerpt of job qualifications for Uber’s Senior Data Scientist position:
- Selecting and employing advanced statistical procedures to obtain actionable insights
- Cross-validating models to ensure their generalizability
- Designing and analyzing large-scale online experiments and interpreting the results to draw actionable conclusions
From these three we can glean that Uber employs ‘advanced statistical procedures’, which are really the company’s proprietary algorithms. Some of these may have been reverse-engineered, but the specific weighted values applied by Uber are probably ever-changing. We can also see the apparent use of cross-validation models and deep learning experiments in Uber’s toolset.
The Solution
It can be surmised that Uber employs a more specific set of algorithms to weigh different factors according to region. Much of the geolocational lifting would be completed by the map application, which employs its own high-level correlation between GPS and vehicle telemetry to pinpoint the nearest drivers and the best routes. Uber adds machine learning, artificial intelligence, and route optimization algorithms that draw upon real-time data continuously.
Case 3: Open-Source Machine Learning
Our third use case examines a success story that isn’t directly related to business or victory, but the type of innovation that affects everyone. TensorFlow–originally created by Google Research–is an open-source tool created by data scientists for data scientists to employ and manage machine learning operations. As such, it is a complex application meant for deep learning projects, to be used by software engineers and those of similar experience.
The Challenge
The challenge of TensorFlow was to adapt a concept as varied and specific as machine learning into a cohesive product that is adaptable to the rigors of any task. As machine learning can apply to virtually any field that employs large data sets, creating an adaptable solution means you must consider all avenues of inquiry within reason.
The Science
A ‘tensor’ is a mathematical term that describes the multilineal relationship between sets of algebraic objects, which makes them a ‘straightforward’ way to describe physical objects using mathematical dimensions in this context. Connecting complex mathematical objects in such a straightforward yet versatile way enables such high-level data diving that can only occur with node mapping and neural networks.
As its moniker suggests, TensorFlow is most useful in graphing the relationships between these complex entities in a way that provides far more insight. This allows software developers, hardware developers, and even social media marketers to observe more specific patterns in large data sets with many varying attributes, and make corrections accordingly.
The Solution
Nvidia, the world’s most well-known video card developer, provided an ideal use case about the value of TensorFlow for production in their article describing the product. Nvidia describes the benefits of using TensorFlow, and by extension, the benefits of deep learning algorithms for worldwide hardware manufacturers.
Nvidia uses Tensorflow to model various processes, but they most likely spend the most time creating computational simulations that represent actual hardware. And if the obvious connection between their GPUs and deep learning seems dubious, note that even larger companies like Twitter, Airbus, and PayPal employ TensorFlow as a foundational tool.
More on Data Science
Thanks to tech giants like Google, Oracle, and Linux Foundation actively supporting open source development, data sciences like deep learning and artificial intelligence have become available to everyone. And while some companies might think that the concepts involved are beyond their means, they’re likely already applying data science in some capacity each time they use an app for work.
If you’d like to learn more about how data science can be applied to your specific modeling needs, more resources are available here.
References: