Learning statistics is essential for pursuing a career in data science or analytics. Data scientists and analysts use statistics to uncover the meaning behind data. A spreadsheet with millions of customer characteristics is just a bunch of numbers and can be overwhelming – but when you translate the data into key findings, the information can unveil trends and inform decisions.
“Statistics is the art and science of learning with data,” says Michael Posner, associate professor of statistics and director of the Center for Statistics Education at Villanova University. “It is about using data to inform decision-making or to gain knowledge.”
The good news is that you don’t need to enroll in a university to learn basic statistics. Many free online tools teach statistics concepts so you can prepare for a career in data science or analytics. This guide will help you get started.
Statistics is essential in data science and analytics professions. “Someone without strong statistical thinking skills will conduct analyses without full consideration of what is most appropriate in a given situation, often getting the right answer to the wrong question,” Posner says.
It helps data scientists and analysts tell the story behind the data. “Statistics can take the collected, cleaned, sorted and summarized data that analytics gives us and help us push it a bit further,” says Phong Le, associate professor of mathematics at Goucher College in Maryland who teaches classes in Goucher’s integrative data analytics major.
In her role as a data scientist at the research firm Valkyrie in Austin, Texas, Keatra Nesbitt relies on statistics to help clients understand data so they can make important business decisions.
“Because of statistics, I’ve been able to analyze financial data at a university, improve a high school’s state-mandated math test scores from a 54% pass rate to over 90%, rebuke a company’s misconceptions about its employees and identify a successful brand strategy for a large corporation to outperform other brands,” she says. “No matter the type of problem you are presented with, being a statistician gives you the critical thinking skills necessary to approach the issue.”
Statistics and Data Science
“Data science is the combination of statistics and computer science,” Nesbitt says, adding that statistics is a core component to pursuing a career in data science.
By using statistics, data scientists can gather raw data and make conclusions about what those numbers mean. Statistics also helps them weed out data, separating meaningful information from superfluous data.
“When analyzing features in the dataset, I can test if the sample differences are statistically significant,” Nesbitt says. “This may change the design or type of input features used in the model.”
What’s the difference between statistics and data science? Phong says that in practice, data science is “the gas pedal, finding patterns and creating dramatic summaries and visualizations,” while statistics is the brake pedal, “reminding us that not everything data-driven is generalizable and what worked before may not work in the future.”
Statistics and Machine Learning
“The field of machine learning has borrowed several concepts from statistics and built new algorithms and tools on top of them while also incorporating theory from other mathematical fields, such as linear algebra, calculus and discrete mathematics,” says Vangelis Metsis, assistant professor in Texas State University’s computer science department.
While statistics is the process of understanding relationships between dependent and independent variables, Metsis says machine learning is about applying the data to make accurate predictions, even if that relationship is not fully understood.
Statistics helps experts understand why machine learning models behave the way they do, Metsis adds. It allows users to interpret the increasingly complex models used in machine learning.
Statistics and Its Use with Data and Analytics
Statistics is widely used in business. Business analysts use statistics to analyze data so managers can make decisions. For example, analysts might study data related to business performance and use it to predict possible outcomes, allowing a company to plan for the future.
Business analysts aren’t the only ones who should understand data. Even if you are not responsible for overseeing spreadsheets, coding or collecting data, “you need to know precisely how good data can enhance your decision-making and build your perspective,” Le says.
To get started learning statistics for a data science or analytics career, start with the basics. Statisticians use the following core concepts to analyze a dataset:
Mean is another word for the average of a dataset. Statisticians use different types of means. The arithmetic mean is the “average” that you probably learned in math. To get an average, you add a set of values (1, 2, 3) and divide it by the number of values (3). Beyond this, there are other types of means: weighted mean, geometric mean, harmonic mean and heronian mean.
The mode of a dataset is the most common value. For example, if you have a dataset of 5, 5, 6, 7, 8, the mode would be 5 because there are two 5s in the dataset.
The median is the middle value of a dataset when written in ascending order. In the dataset 5, 5, 6, 7, 8, the median is 6 because there are two numbers below it and two numbers above it.
Correlation is when you try to determine the relationship between variables, Posner says. “For example, is there a relationship between smoking and lung cancer?” Correlation is measured on a scale of -1 to 1. Negative-one is when variables move in exact opposite directions, and 1 is when variables move in the exact same direction. A correlation of 0 indicates there is no link between the variables.
Standard deviation measures the spread of a dataset around its average. Standard deviation quantifies the disbursement of values around the average. It is commonly displayed in a bell curve graph. The mean is the high point in the center of the curve.
Uncertainty in statistics is measured by the degree of error in an estimate. This is often reported as a margin of error or bias.
Margin of Error
The margin of error measures how different sample results are from the real population value. It is portrayed as a percentage in a confidence interval. For instance, a 90% confidence interval with a 5% margin of error indicates your result will be within 5% of the population value 90% of the time.
Bias measures how likely an estimate is to over- or underrepresent the actual value. “Is there anything about the process used to collect or process the data that makes your estimate not accurate?” Posner asks. “For example, if you asked people their weight, those that choose not to answer your question might be heavier than those who choose to answer, so you have underestimated the true value of average weight in the population.”
Descriptive statistics helps you analyze and present data in a way that can be easily interpreted. It describes the characteristics of a given dataset using the core concepts outlined above.
“Descriptive statistics reveal a lot about the data, but are simple to calculate and don’t require much skill or computing power,” Posner says.
Instead of presenting a long list of numbers, descriptive statistics allows analysts to determine the mean, median and standard deviation, so they can better understand how data is distributed. Because of this, descriptive statistics allows data scientists and other analysts to better interpret the numbers.
Descriptive statistics also helps with data visualization. “Not only do we calculate summary measures … but we look at graphical displays that give you the entire distribution of data,” Posner says. “This not only shows you the shape and location of the data, but also whether there are outliers that are different from the rest of the data or other interesting characteristics of the data.”
Descriptive statistics uses measures of central tendency, such as mean and median, to describe the center of the dataset and measures of variability, such as standard deviation, minimum and maximum. Measures of variability are used to describe the spread of the data.
What descriptive statistics does not do is allow you to generalize where the data sample came from, Metsis says. “For example, a basketball team may want to use descriptive statistics to understand the performance of their players and make improvements to their training practices but (does not) attempt to extrapolate those findings to the whole league.”
Since machine learning uses data to make predictions rather than to understand a given dataset, this and similar fields like data science are more closely related to inferential statistics, Metsis says.
While descriptive statistics is used to explain the characteristics of a dataset, inferential statistics allows you to make predictions based on that data.
“The purpose of the inferential statistic is to understand the properties of the whole population by studying the behavior of a set of variables on a smaller sample,” Metsis says. “To go back to the sports analogy, a basketball league may study a few players’ performance statistics to understand how traveling affects the game performance of basketball players as a whole.”
Inferential statistics involves estimation and hypothesis testing. In estimation, you use the sample dataset to make a statement about the broader population. This extrapolation requires incorporating uncertainty into the analysis. To address this, statisticians apply a margin of error to their estimates.
“For example, a poll that says 45% of people will vote for Trump with a margin of error of 1% means that we are confident that between 44% and 46% will vote for him,” Posner says. “A poll that says 45% of people will vote for Trump with a margin of error of 20% means that we are confident that between 25% and 65% of people will vote for him.”
Given these margins of error, you can see that the first poll is more meaningful.
In hypothesis testing, statisticians try to use a dataset to answer research questions, such as who will win the next presidential election or if traveling hinders the performance of basketball players.
“Inference and the ability to generalize is a core design principle of many machine learning algorithms,” Metsis says. “In fact, the whole idea of machine learning is predicated on learning from a limited set of training examples and subsequently applying the gained knowledge outside of the dataset used for training.”
Data science and machine learning use predictive modeling, also called predictive analytics, to make future predictions based on past information. Datasets are analyzed for patterns and trends that can be used to create a model of potential future outcomes. Then, those outcomes are assigned a probability for how likely they are to occur.
Predictive modeling can be used to forecast behavior or determine the risk of a negative outcome occurring in a variety of fields. For example, marketing analysts use predictive modeling to determine how a business is performing by looking at metrics like return on investment.
Predictive modeling applies a variety of analytic tools – in particular, regression, which fits a dataset to a predictive model. Linear regression is the simplest and most widely used form of regression analysis. A linear equation is a model for the relationship between two variables. One variable is considered to be independent, referred to as the explanatory variable. The other is the dependent variable, and its value depends on the first.
Logistic regression is similar to linear regression, except instead of using two variables, it uses one measurement variable and one nominal, or categorical, variable, which has no numeric value. Examples of nominal variables are gender and occupation. When the dependent nominal variable has two potential values, it is considered a binary logistic regression. When it has more than two potential values, it is a multinomial logistic regression. If the dependent variable is meant to be ranked, it is called an ordinal logistic regression.
In logistic regression, the measurement variable is the independent variable. For instance, you might want to model whether it will rain (nominal variable) based on the temperature outside. In this case, you would write the logistic regression model as the probability that it will rain, given the temperature. Fields like machine learning use logistic regression when dealing with binary classification models where you’re trying to model a scenario with two potential outcomes.
Python is a general-purpose, high-level programming language. General-purpose means it is used in a variety of applications, as opposed to special-purpose programming languages, which are designed to solve a specific set of problems. Being high-level means Python is designed to be simpler and easier to read than the actual code run by a computer.
Python has gained traction in machine learning fields and its subfields, thanks in part to its intuitive, easy-to-learn nature, Metsis says.
As a high-level language, Python also has productivity advantages compared with other programming languages, like C. “With a few lines of code, you can do things that in other languages would require many more lines of code to complete,” Posner says.
Posner says Python’s extensive collection of free libraries is the main reason it has become a go-to language for building machine learning applications.
R is another programming language used by statisticians. It provides a variety of statistical techniques for data storage and manipulation, such as time-series analysis, and linear and nonlinear modeling. R also lets users create graph representations of their data, both on-screen and in hard copy, and define new functions beyond pre-built ones.
“For data analysis, most statisticians use R (some use SAS or Python), and most computer scientists use Python,” Posner says. “If you want a profession in data science or analytics, it’s generally recommended to know both of them and have expertise in at least one.”
“Statistics is an in-depth study, not an overnight study, so there will always be more to learn,” Nesbitt says.
Aspiring learners should start with the basics, such as measures of central tendency, probability and normal distributions, Nesbitt says. Then, apply statistical principles to real-world problems. “Sometimes, it’s easier to learn when you can address a concrete problem versus a hypothetical one,” she says. “You’ll build your knowledge base as you are introduced to new scenarios and examples.”
You can find hands-on learning projects in your own backyard. Le points to Baltimore’s 311 Customer Service Requests dataset, freely available thanks to the city’s open data initiative. “In those 7 million rows, there are hundreds of stories,” he says.
Le has a friend who made a heatmap of all the trash complaints by streetcorner in the neighborhood to give to the city. “Those spots were targeted during neighborhood cleanups,” he says.
He recommends those looking to learn statistics seek similar civic open data initiatives. “Like the cities themselves, each of these data repositories have their own feel,” he says. “They might have their own basic analysis tools to help get you going.”
Once you know what’s available, the next step is figuring out what big questions data can help answer.
There are a number of online resources to help you learn statistics. Massachusetts Institute of Technology is offering a course called Fundamentals of Statistics for free through edX, an online learning provider. Class begins May 10, 2021, and lasts 18 weeks. For $300 you can get a verified certificate of completion. Other courses are also available through MIT OpenCourseWare.
Books can also be helpful study guides. Le likes “How to Lie with Statistics” by Darrell Huff because of how it explains the ways “statistics is used, abused and misunderstood.” Other books he recommends include “The Lady Tasting Tea” by David Salsburg, “Moneyball” by Michael Lewis and “The Signal and the Noise” by Nate Silver.
There are many paths you can take to learn statistics, from pursuing an undergraduate or master’s degree to creating your own “degree” program with free online classes. However you decide to pursue your learning, to be successful in studying statistics, you need to be disciplined in your approach.
Start by creating a study schedule. If you’re taking statistics classes, plan on spending at least two hours studying for every hour of class. Consider joining study groups or seek out online communities of people supporting each other in their learning processes. You may even be able to find a mentor who can help you along the way.
The most important element to succeeding in your study of statistics is to stick with it. Remember your reason for learning statistics. When you understand the math behind statistics, you’ll open the door to new career opportunities in data science, analytics and many other fields.
“Mathematics is interwoven into our world, from marketing to finance and everything in between, and when you start to make those connections, you’ll naturally become a better statistician,” Nesbitt says.