In the world of process improvement and data analysis, understanding the relationships between variables is crucial for making informed decisions. The Analyse phase of the DMAIC (Define, Measure, Analyse, Improve, Control) methodology in Lean Six Sigma provides practitioners with powerful statistical tools to uncover hidden patterns and correlations within their data. Among these tools, scatter plots stand out as one of the most intuitive and visually compelling methods for examining how two variables relate to each other.
Whether you are a quality manager seeking to reduce defects, a process engineer optimizing production efficiency, or a business analyst looking to improve customer satisfaction, scatter plots offer a straightforward approach to visualizing complex data relationships. This comprehensive guide will walk you through the fundamentals of creating and interpreting scatter plots during the Analyse phase, complete with practical examples and real-world applications. You might also enjoy reading about Multi-Vari Analysis: A Powerful Tool for Identifying Sources of Variation in Your Process.
Understanding the Analyse Phase in Lean Six Sigma
The Analyse phase represents the third critical stage in the DMAIC framework, following Define and Measure. After defining the problem and collecting relevant data, the Analyse phase focuses on identifying the root causes of defects, variations, or inefficiencies in your process. This phase transforms raw data into actionable insights through various statistical and graphical analysis techniques. You might also enjoy reading about T-Test in Six Sigma: How to Compare Means and Identify Significant Differences in Your Data.
During the Analyse phase, practitioners employ multiple tools including hypothesis testing, regression analysis, process capability studies, and visual data analysis methods. Scatter plots fall into this last category, serving as a fundamental graphical tool that helps teams visualize potential relationships between input variables (X) and output variables (Y), or between two input variables that might influence each other. You might also enjoy reading about Minitab for Analyze Phase: Key Statistical Tests and How to Run Them in Lean Six Sigma.
What Are Scatter Plots and Why Are They Important?
A scatter plot, also known as a scatter diagram or scatter graph, is a type of mathematical diagram using Cartesian coordinates to display values for two variables from a dataset. Each point on the plot represents an observation, with its position determined by the values of two variables: one plotted along the horizontal axis (x-axis) and another along the vertical axis (y-axis).
The primary purpose of scatter plots in Lean Six Sigma is to investigate whether a relationship exists between two variables and, if so, what the nature and strength of that relationship might be. This visualization technique allows practitioners to:
- Identify positive, negative, or no correlation between variables
- Detect outliers that may require further investigation
- Recognize patterns, trends, or clusters within the data
- Determine the strength of relationships between process inputs and outputs
- Make predictions based on observed relationships
- Communicate findings effectively to stakeholders who may not have technical backgrounds
Types of Relationships Revealed by Scatter Plots
When you create a scatter plot, the pattern formed by the plotted points can reveal several types of relationships between your variables:
Positive Correlation
When both variables tend to increase together, you observe a positive correlation. The points on the scatter plot will form a pattern that rises from left to right. For example, in a manufacturing setting, you might find that as machine temperature increases, production speed also increases, up to a certain point.
Negative Correlation
When one variable increases while the other decreases, you have a negative correlation. The scatter plot pattern descends from left to right. A common example would be the relationship between defect rates and employee training hours, where more training typically results in fewer defects.
No Correlation
Sometimes, two variables show no discernible pattern or relationship. In these cases, the points appear randomly scattered across the plot without any clear trend. This finding is equally valuable, as it tells you that the variables do not influence each other, allowing you to focus your improvement efforts elsewhere.
Non-Linear Relationships
Not all relationships follow a straight line. Variables might exhibit curved patterns, exponential relationships, or other complex interactions. Recognizing these patterns requires careful observation and may necessitate advanced statistical techniques beyond simple linear correlation analysis.
Creating Scatter Plots: A Step-by-Step Guide
Creating an effective scatter plot involves several deliberate steps that ensure your analysis yields meaningful insights:
Step 1: Identify Your Variables
Begin by clearly identifying the two variables you want to examine. Typically, you will designate one as the independent variable (X), which you suspect might influence the outcome, and one as the dependent variable (Y), which is the outcome you are trying to understand or predict. In Lean Six Sigma terminology, X represents a process input or potential root cause, while Y represents the process output or effect.
Step 2: Collect Your Data
Gather paired data points for both variables. Each observation must have values for both the X and Y variables. The quality and quantity of your data will significantly impact the reliability of your analysis. Aim for at least 30 data points when possible, though meaningful patterns can sometimes emerge with fewer observations.
Step 3: Organize Your Data
Structure your data in a table format with two columns, one for each variable. Ensure data integrity by checking for missing values, obvious errors, or anomalies that might skew your analysis.
Step 4: Construct the Plot
Create your scatter plot using statistical software, spreadsheet applications, or specialized quality management tools. Plot the independent variable (X) on the horizontal axis and the dependent variable (Y) on the vertical axis. Each pair of values becomes a single point on the graph.
Step 5: Label and Scale Your Axes
Clearly label both axes with the variable names and their units of measurement. Choose appropriate scales that display your data effectively without compressing or distorting the visual pattern. The scale should encompass all your data points while making efficient use of the plotting space.
Step 6: Analyze the Pattern
Study the resulting plot to identify patterns, trends, or relationships. Look for the overall direction of the data, the strength of any apparent relationship, and any outliers or unusual observations.
Practical Example: Manufacturing Process Optimization
Let us explore a detailed example from a manufacturing environment to illustrate how scatter plots work in practice. Imagine a pharmaceutical company experiencing inconsistent tablet hardness in their production line. The quality team suspects that compression force during tablet formation might be related to the final hardness measurement.
Sample Dataset
The team collects data over two weeks, measuring both compression force (in kilonewtons) and resulting tablet hardness (in kiloponds) for 25 production batches:
Data Collection Table:
- Batch 1: Compression Force 8.5 kN, Tablet Hardness 12.3 kp
- Batch 2: Compression Force 9.2 kN, Tablet Hardness 13.8 kp
- Batch 3: Compression Force 7.8 kN, Tablet Hardness 11.2 kp
- Batch 4: Compression Force 10.1 kN, Tablet Hardness 15.2 kp
- Batch 5: Compression Force 8.9 kN, Tablet Hardness 13.1 kp
- Batch 6: Compression Force 9.8 kN, Tablet Hardness 14.6 kp
- Batch 7: Compression Force 8.2 kN, Tablet Hardness 11.9 kp
- Batch 8: Compression Force 10.5 kN, Tablet Hardness 15.8 kp
- Batch 9: Compression Force 9.5 kN, Tablet Hardness 14.1 kp
- Batch 10: Compression Force 8.0 kN, Tablet Hardness 11.5 kp
- Batch 11: Compression Force 11.2 kN, Tablet Hardness 16.4 kp
- Batch 12: Compression Force 9.0 kN, Tablet Hardness 13.3 kp
- Batch 13: Compression Force 8.7 kN, Tablet Hardness 12.8 kp
- Batch 14: Compression Force 10.8 kN, Tablet Hardness 16.0 kp
- Batch 15: Compression Force 9.4 kN, Tablet Hardness 13.9 kp
- Batch 16: Compression Force 7.5 kN, Tablet Hardness 10.8 kp
- Batch 17: Compression Force 10.3 kN, Tablet Hardness 15.4 kp
- Batch 18: Compression Force 8.8 kN, Tablet Hardness 13.0 kp
- Batch 19: Compression Force 9.9 kN, Tablet Hardness 14.8 kp
- Batch 20: Compression Force 8.4 kN, Tablet Hardness 12.1 kp
- Batch 21: Compression Force 11.0 kN, Tablet Hardness 16.2 kp
- Batch 22: Compression Force 9.3 kN, Tablet Hardness 13.7 kp
- Batch 23: Compression Force 8.1 kN, Tablet Hardness 11.7 kp
- Batch 24: Compression Force 10.6 kN, Tablet Hardness 15.6 kp
- Batch 25: Compression Force 9.1 kN, Tablet Hardness 13.5 kp
Interpreting the Results
When this data is plotted on a scatter diagram with compression force on the x-axis and tablet hardness on the y-axis, a clear positive correlation emerges. The points form a pattern that rises from left to right, indicating that as compression force increases, tablet hardness also tends to increase. This relationship appears to be fairly strong and linear within the observed range.
The scatter plot reveals several important insights for the quality team. First, the strong positive correlation confirms their hypothesis that compression force is indeed a critical process parameter affecting tablet hardness. Second, the relatively tight clustering of points around an imaginary trend line suggests that this relationship is consistent and predictable. Third, there are no obvious outliers, which indicates that the process is relatively stable without any extreme anomalies that might require special investigation.
Based on these findings, the team can now move forward with confidence, knowing that controlling compression force will be an effective strategy for maintaining consistent tablet hardness. They might establish control limits for compression force, standardize machine settings, or implement monitoring systems to ensure the force stays within an optimal range that produces tablets meeting hardness specifications.
Advanced Scatter Plot Techniques
Adding Trend Lines
Most statistical software packages allow you to add a trend line, also called a regression line or line of best fit, to your scatter plot. This line represents the mathematical relationship between the two variables and can be used to make predictions. The equation of the trend line provides quantitative information about how much Y changes for each unit change in X.
Calculating Correlation Coefficients
While visual inspection of scatter plots provides valuable insights, calculating the correlation coefficient (r) offers a numerical measure of relationship strength. This value ranges from negative one to positive one, where values closer to negative one or positive one indicate stronger relationships, and values near zero indicate weak or no correlation. A correlation coefficient above 0.7 or below negative 0.7 typically indicates a strong relationship worth investigating further.
Creating Matrix Plots
When you need to examine relationships among multiple variables simultaneously, matrix plots (also called scatter plot matrices) display scatter plots for every possible pair of variables in a single visualization. This technique is particularly useful during exploratory data analysis when you are not sure which variables might be related.
Common Pitfalls and How to Avoid Them
While scatter plots are powerful tools, several common mistakes can lead to misinterpretation or missed insights:
Confusing Correlation with Causation
Perhaps the most critical error is assuming that correlation implies causation. Just because two variables show a relationship on a scatter plot does not necessarily mean that one causes the other. There might be a third variable influencing both, or the correlation might be coincidental. Always use scatter plots as part of a broader analytical approach that includes process knowledge and additional validation.
Ignoring Outliers
Outliers can significantly impact your interpretation and any subsequent statistical analysis. Rather than dismissing outliers as errors, investigate them thoroughly. They often represent special causes that provide valuable insights into process behavior or unusual conditions that need attention.
Using Inappropriate Sample Sizes
Too few data points can lead to misleading patterns, while unnecessarily large datasets might obscure important relationships through overplotting. Aim for a sample size that balances statistical validity with practical considerations.
Failing to Consider Time
Standard scatter plots do not display temporal information. If your data was collected over time, patterns might exist that are not visible in a scatter plot. Consider supplementing your scatter plot analysis with time series plots or control charts.
Real-World Applications Across Industries
Scatter plots find applications across virtually every industry where process improvement and data analysis matter:
In healthcare, hospitals use scatter plots to examine relationships between patient wait times and staff levels, or between medication dosages and patient outcomes. In retail, analysts create scatter plots to understand how promotional spending relates to sales increases. In software development, teams plot the relationship between code complexity and bug rates. In education, administrators examine how study hours correlate with test scores.
These diverse applications demonstrate the universal value of scatter plots as a fundamental analytical tool in the Lean Six Sigma toolkit.
Integration with Other Analyse Phase Tools
Scatter plots work most effectively when combined with other analytical tools during the Analyse phase. For instance, you might use a fishbone diagram to identify potential root causes, then create scatter plots to test which of those potential causes actually correlate with your output variable. Process maps help you identify which process parameters to plot against quality metrics. Hypothesis tests can provide statistical validation of relationships you observe visually in scatter plots.
This integrated approach ensures comprehensive analysis that leads to accurate conclusions and effective improvement strategies.
Software Tools for Creating Scatter Plots
Modern practitioners have access to numerous software options for creating scatter plots, ranging from simple to sophisticated:
Spreadsheet applications like Microsoft Excel or Google Sheets offer basic scatter plot functionality that suffices for many applications. Statistical software packages such as Minitab, JMP, or R provide advanced features including automated correlation analysis, regression modeling, and matrix plots. Specialized quality management software often includes scatter plot capabilities alongside other Lean Six Sigma tools. Business intelligence platforms like Tableau or Power BI enable interactive scatter plots that stakeholders can explore dynamically.
The choice of software depends on your specific needs, budget, and the complexity of your analysis.
Best Practices for Presenting Scatter Plots
Creating scatter plots is only half the battle; presenting your findings effectively is equally important:
Always provide context by explaining what the variables represent and why their relationship matters. Use clear, descriptive titles that communicate the key finding at a glance. Choose colors and symbols thoughtfully, ensuring your plots are accessible to colorblind viewers when possible. Include reference lines or target zones when relevant to show specifications or goals. Accompany your scatter plots with brief written interpretations that guide viewers toward the important insights.
Remember that your audience may include stakeholders without technical backgrounds, so clarity and simplicity should guide your presentation choices.
Moving from Analysis to Action
The ultimate goal of creating scatter plots during the Analyse phase is not simply to understand relationships but to use that understanding to drive improvement. Once you have identified significant correlations between process inputs and outputs, you can develop targeted improvement strategies that








