Survey Gamification? It’s About Good Survey Design
Thursday, November 3, 2011| by Michaela Mora | ![]() |

At the root of survey gamification are good, sound survey design principles. That’s the main message from Reg Baker’s presentation at the MR Festival.
Baker follows the cognitive process (Tourangeau, Rips and Rasinski, 2000) involved in how respondents process information and survey questions and points out the opportunities to create engaging surveys. When faced with survey questions respondents go through different phases:
Survey tool providers are racing to create different question formats (e.g. sliders, heatmaps, etc.) to make the survey-taking experience more engaging and minimize abandonment rates. However, with the increase of surveys and DIY research done by inexperienced people, the quality of survey design has declined. Writing surveys looks easy, but it is not. Fun and cool question formats can’t compensate for ill-designed questions.
I have to agree with Baker that the greatest improvement needed now to engage respondents is in survey design.

Sometimes I’m asked to review surveys or analyze data collected via surveys developed by clients and more often than not I find rating scales, (aka Likert scales) of different sizes and directions within the same survey. When I ask why, I get answers such “It” or “This is the one we have always used.”
It seems rating scales are often chosen based on preference or habit (e.g. legacy surveys), which is not surprising since there is no consensus on what rating scales work best. They all yield different results, which is disheartening in a way.
There has been a lot of research dedicated to this subject illustrating there is no simple answer to the question on which rating scales we should use.

Source: International Journal of Social Research Methodology, Vol. 13, No.1 Feb. 2010, 17-27 (Hartley and Betts)
This extensive body of research shows that different rating scales are bound to yield different results as we are mainly dealing with human perception. Rating scales mean different things to different people and the values, words, and order in which we present them have an impact on how they are interpreted. What to do?


Attitudinal questions are common in surveys. They are often asked using an agree-disagree rating question format. The challenge is always to create statements that capture important elements of the attitudes we are trying to measure. Ideally, if the budget allows it, we should do qualitative research to gather insights into such elements and how people think and talk about them.
Even with qualitative data available, writing good attitudinal statements is not an easy task. Here are some guidelines to facilitate the process:

Writing short surveys is an uphill battle with many clients. Whenever the word is out that a survey will be conducted, everybody close to the subject, being the product team, senior management or operations, wants to add questions. The thought is, “since we are doing a survey let’s get as much as possible out of it.”
Unfortunately, the only thing you get out with very long surveys is bad quality data. Why?
NON-RESPONSE & ABANDONMENT
As the survey length increases, so does the non-response bias and abandonment rate. Simply said, respondents won’t stay too long answering questions. Many won’t even start if they know the survey length (It is a best practice to announce the length of the survey in the invitation).

For those who think they can get away with it by not announcing how long the survey will be, think again. Respondents can always figure out the length from the progress bar and will drop in the middle of the survey if they perceive it as too long (even if no progress bar is shown). High abandonment and non-response rates affect sample representativeness negatively.
In an experiment conducted by Galesic and Bosnjac (2003) to prove this point, 3,472 respondents were divided in 3 groups based on an online survey with different lengths (10, 20 and 30 minutes). The chart above shows how the number of respondent who started and completed the survey declined as the survey length increased.
Respondents, who are willing to endure a long survey, are at high risk of experiencing high burden and becoming “satisficers.”
Satisfacing occurs when the respondents select the answer options without giving them too much thought. They go for the most effortless mental activity trying to satisfy the question requirement, rather than work on finding the optimal answers that best represent their opinion. Respondents may start selecting the first choice in every question, straight-lining in grid questions (selecting the same across all options) or simply selecting random choices without much consideration. This type of behavior renders the data worthless.
The same experiment by Galesic and Bosnjac was set to test the impact of survey length on data quality, which was measured with a variety of indicators including response times, item response rate, length of answers to open-ended questions, and variability of answers to questions in grids.
Of all the indicators, item response rate (defined as the percentage completed from all questions presented in a block) was the only one that seemed unaffected by survey length, however it is unclear if the survey was programmed to force respondents to answer before going forward in the survey. For the other indicators, the results strongly suggest that survey length affects quality.

There are powerful reasons that push clients and force research vendors to launch long surveys. Budget, time constraints, and different agendas from internal groups are some of them. However, when surveys start getting too long, clients and research vendors should take a minute to think about the implications. After all if we get bad data, we have wasted the little time and money we started with.
| by Michaela Mora | ![]() |

Multiple-choice questions (check all that apply) are one of the most common question formats found in online surveys. However, there are a couple of problems with this type of question:
WHAT CAN WE DO ABOUT IT?
A solution to this problem is to ask multiple-choice questions as a series of forced yes/no answers for each of the question items. This format requires that respondents report a judgment on each of the items. Research has shown that forced yes/no questions encourage deeper processing time and discourage satisficing response strategies as measured by the time spent on answering forced yes/no vs. check-all questions and the number of items marked affirmatively in each question format. Research by Smyth et al. (2003), comparing results from both types of formats in online surveys has found that:
We can argue that the longer time spent answering the forced yes/no questions is a mechanical function of the fact that respondents are forced to give an answer for each item and spend extra time marking “no,” which is not required in the check-all question.
However, the positive correlation between time spent on answering the question and the number of options selected has also been shown to be an indicator of deeper processing and more thoughtful answers for the check-all formatted questions as well. Respondents who spend more time answering check-all questions mark significantly more answers than those who answer check-all questions in less time.
Another research result supporting the hypothesis of deep processing is that no significant differences have been found in the number of options marked affirmatively between respondents that take longer time answering yes/no questions and check-all questions.
The yes/no format for multiple-choice question are not a 100% foolproof, as some respondent may still show satisficing behavior by marking yes or no for all options or marking them randomly. In this case we need to put quality checks in place during programming that take into account the time spent on the question and any straightlining patterns.
An issue that we also need to manage is the fact that sometimes respondents are undecided or think an option doesn’t apply to them. In this case it would be wrong to force them to give a yes or no answer. The best remedies against this problem are respondent screening and survey skips that would avoid showing options that don’t apply. In cases in which there is still room for this problem, I recommend adding a third “Don’t Know/Not Applicable” option.
IMPLICATIONS FOR MIXED DATA COLLECTION MODES
The forced yes/no format for multiple-choice questions is commonly used in phone surveys since it is impractical to read all the options to respondents and expect them to remember them all to answer the question. Often, when mixed data collection modes are used, (phone/online, phone/paper), the yes/no and check-all question formats are treated as equivalent, assuming they are answered the same way. Research suggests that this would be a mistake.
Experiments carried out by Smyth et al. (2008) with phone and online survey using both question formats have shown that the forced yes/no format yields consistently more options marked affirmatively than check-all formatted questions in online self-administrated and phone-administrated surveys. This supports the idea that results from both question formats are not comparable and shouldn’t be treated interchangeably.
KEY TAKEAWAYS

There are many things to consider if we want to write surveys that gather high quality data, including data collection method, respondent effort requested, question wording, order, format, structure, visual layout behaviors to be measured, accuracy of the elicited information, among others. Although all these issues are important, at the end of the day, what we want is to create surveys that yield results that are valid and reliable.
Validity and reliability are often discussed in the field of psychometrics, but not so much in market research, although it is assumed they are present.
Validity is concerned with the accuracy of our measurement, and it is often discussed in the context of sample representativeness. However, validity is also affected by survey design since it also depends on asking questions that measure what we are supposed to be measuring.
Most surveys often have what is called face validity, which is a matter of appearances. The questions seem like a reasonable way to obtain the information we are looking for, but are they really? There are other types of validity survey writers should strive for:
Reliability, on the other hand, is concerned with the consistency of our measurement, that’s the degree to which the questions used in a survey elicit the same type of information each time they are used under the same conditions. This is particularly important in satisfaction and brand tracking studies, as changes in question wording and structure are likely to elicit different responses.
Reliability is also related to internal consistency, which refers to the degree different questions or statements measure the same characteristic. A practical application of this concept can be found in marketing segmentation studies that try to capture psychographics and construct behavioral or satisfaction segments by asking respondent to rate a list of statements using different rating scales (e.g. agreement/disagreement; likes/dislikes; excellent/poor, etc.). In our example, if we want to identify “lovers of styling products,” the statements used to describe such consumers should provide a consistent description of this group. This can be tested by using correlations, split sample comparisons or methods such as Cronbach’s Alpha.
Validity and reliability are not always aligned. Reliability is needed, but not sufficient to establish validity. We can get high reliability and low validity. This would happen when the wrong questions are asked over and over again, consistently yielding bad information. Also, if the results show large variation, they may be valid, but not reliable. So, don’t forget to think about reliability and validity when writing your next survey and strive for reliable and valid results.

Maxdiff is a superior technique for the research of preferences or importance. In our presentation at the 2010 AMA Market Research conference in Atlanta, my colleague Kathryn Korostoff from Research Rockstar and I made the case for MaxDiff and discussed its advantages over rating, ranking and constant sum questions.
Rating questions are susceptible to:
Ranking questions’ limitations include:
Constant sum questions’ weaknesses include:
Given the problems with each of these question types, particularly with rating questions, has led to an increased interest in the use of Maximum Difference Scaling or MaxDiff as is commonly called.
Maxdiff is a trade-off analysis technique that allows us to do multiple pairwise comparisons in an effective way by asking respondents to select the most and the least preferred or important items out of a list we want to test in search for the greatest differences among items.
MAXDIFF BENEFITS
THE PROCESS
In order to implement MaxDiff we need to:
The standard output of MaxDiff analysis is usually a ranking of the items tested based on rescaled utilities, but these can also be used to conduct further multivariate analysis such as correlations analysis, multiple regression, t-testing, TURF analysis, cluster analysis, latent class segmentation, etc.
MAXDIFF APPLICATIONS
MaxDiff can be used to study preferences for and importance of a number of things including:

MaxDiff is not perfect. It usually takes longer for the respondent to take, and depending on your research goals, the relative measure it provides may not be what you want. MaxDiff helps you prioritize within a given list of items, but it doesn’t tell you if all are preferred/important or not from an absolute perspective. However, the latter is less of a problem as we can include additional questions, which allow us to calibrate the MaxDiff ranking to “absolute” levels of importance or preference.
Nonetheless, next time you need to measure preferences or importance consider using Maxdiff instead of traditional approaches such as rating, ranking or constant sum questions. You will gain in data quality, greater discrimination and the ability to provide better insights to support business decisions.
Stay tune for the upcoming case study of how MaxDiff can be used in market segmentation.
To learn more about our consumer data service visit Consumer Shopping Behavior Insights. To request consumer shopping behavior data and insights don’t hesitate to contact us.
As published on July 6, 2010 in the July 2010 issue of the Quirk’s Marketing Research Review.

The advent of user-friendly online survey tools in recent years has created the illusion that anybody can write a survey questionnaire. After all, how hard can it be? It’s like asking questions in a conversation, many think. However, there are many methodological issues to consider when creating a questionnaire if you want to gather high-quality data in a survey. The following are 10 issues that arise in survey design.
Some questions may elicit different answers if asked in an online survey, a telephone interview, a paper survey or a face-to-face interview. While words in phone surveys or in-person interviews are given more importance because of the conversational format, visual design elements have a bigger impact in how questions are read and interpreted in online surveys. Be aware of the types of questions that are a good fit for online surveys.
There are questions that put a heavier burden on the respondent’s working memory and comprehension or are likely to elicit higher non-response if asked in different data collection modes. Experience tells us that asking a ranking question with 10 items over the phone can overwhelm respondents. In online surveys, rating questions in matrix format with a large number of items increases fatigue and boredom and often leads respondents to adopt a “satisficing” behavior. Satisficing occurs when respondents select the same scale-point to rate all items without giving them too much thought. They go for the most effortless mental activity trying to satisfy the question requirement, rather than work on finding the optimal answers that best represent their opinion.
Formulating a question with the right wording so it accurately reflects the issue of interest is one of the hardest parts in writing questionnaires. You may have seen political polls getting different answers depending on how a question is crafted. Data errors can creep into a survey if we use unfamiliar, complex or technically-inaccurate words; ask more than one question at a time; use incomplete sentences; use abstract or vague concepts; make the questions too wordy; or ask questions without a clear task.
Another issue related to question wording is the risk of introducing bias by leading the respondent in a particular direction. I recently received a mail survey sponsored by the Republican Party to represent the opinion of voters in my congressional district and one of the questions was:
“Do you think the record trillion-dollar federal deficit the Democrats are creating with their out-of-control spending is going to have disastrous consequences for our nation?”
Could this question be more biased? The use of adjectives such as “record,” “out-of-control” and “disastrous” makes it really clear what the expected answer is and what the intentions of the sponsor are.
Questions should follow a logical flow. Order inconsistencies can confuse respondents and bias the results. For instance if you are measuring brand awareness and ask respondents to recognize brands they are familiar with before asking which brands first come to mind, you are rendering the results from the latter question worthless since respondents can’t avoid thinking of brands they just saw in the first question. This seems basic, but it happens.
Questions can be closed-ended or open-ended. Closed-ended questions provide answer choices, while open-ended questions ask respondents to answer in their own words. Each type of question serves different research objectives and has its own limitations. The key issues here are related to the level of detail and information richness we need, our previous knowledge about the topic, and whether to influence respondents’ answers. For example, for closed-ended questions we need to decide what the answer choices should be and in which order they should appear. This requires we know enough about the topic to provide answer options that capture the information accurately.
Questions can be closed-ended or open-ended. Closed-ended questions provide answer choices, while open-ended questions ask respondents to answer in their own words. Each type of question serves different research objectives and has its own limitations. The key issues here are related to the level of detail and information richness we need; our previous knowledge about the topic; and whether to influence respondents’ answers. For example, for closed-ended questions we need to decide what the answer choices should be and in which order they should appear. This requires we know enough about the topic to provide answer options that capture the information accurately.
Some questions yield more accurate information than others. Respondents can answer questions about their gender and age accurately, but when it comes to attitudes and opinions on a particular issue, many may not have a clear answer. Overall, attitudes and opinion questions should be worded in a way that best reflects how respondents think and talk about a particular issue so that we can tease out information that is difficult for the respondent to articulate. However, some questions need to be skipped when they don’t apply to the respondents’ experience or the issue is so irrelevant to the respondent that s/he doesn’t have a formed opinion about it. In the case in which attitude statements appear grouped in a matrix format and some may not apply to a respondents (e.g., a customer satisfaction survey after a phone call to customer support), it is necessary to include a “Not sure/Don’t know/Not applicable” option to avoid introducing measurement error in the data.
For intance, the other day I received an online customer satisfaction survey from BlackBerry after a call I made to its support desk. The survey had a question in which I was asked to rate the representative who took my call on different aspects. One of them was “Timely Updates: Regular status updates were provided regarding your service request.” I wouldn’t know how to answer this, since the issue I called for didn’t require regular updates. Luckily, they had a “Not applicable” option, otherwise I would have been forced to lie, and one side of the scale would be as good as the other.
People tend to have less-precise memories of mundane behaviors they engage in on regular basis, and usually they do not categorize events by periods of times (e.g., week, month and year). We need to consider appropriate reference periods for the type of behavior we want to measure. Asking “Have you purchased any piece of clothing in the last seven days?” will yield a more accurate behavior measure than asking “Have you purchased any piece of clothing in the last six months?”
Measured behavior should be relevant to the respondent and capture his or her potential state of mind. This is valid particularly when we use rating questions and have to decide whether to include a neutral mid-point. A lot of research has been conducted in this realm, particularly by psychologists concerned with scale development, but no definitive answer has been found and the debate continues. Some studies find support for excluding it while others for including it depending on the subject, audience and type of question.
Those against a neutral point argue that by including it we give respondents an easy way to avoid taking a position on a particular issue. There is also the argument that equates including a neutral point to wasting research dollars, since this information would not be of much value or at worst it would distort the results. This camp advocates for avoiding the use of a neutral point and forcing respondents to tell us on which side of the issue they are.
However, consumers make decisions all day long and many times find themselves idling in neutral. A neutral point can reflect any of these scenarios: we feel ambivalent about the issue and could go either way; we don’t have an opinion about the issue due to lack of knowledge or experience; we never developed an opinion about the issue because we find it irrelevant; we don’t want to give our real opinion if it is not considered socially desirable; or we don’t remember a particular experience related to the issue that is being rated.
By forcing respondents to take a stand when they don’t have a formed opinion about something, we introduce measurement error in the data since we are not capturing a plausible psychological scenario in which respondents may find themselves. This is yet another reason to include a “Not sure/Don’t know/Not applicable” option in addition to a neutral point.
Questions have different parts that must work in harmony to capture high-quality data. These are the question stem (e.g. what is your age?), additional instructions (e.g. select one answer) and response options, if any (e.g. Under 18, 19 to 24, 25 +). The wrong combination can leave respondents baffled about how to answer a question. Consider the example below:
Overlapping answer options
What is your household income? Select one answer.
So, which answer should I choose if I my household income is $50,000? Is it option 2 or option 3?
Conflict in meaning between different parts of the question
Please indicate the products you use most often. Select all that apply.
Using design elements in an inconsistent way can increase the burden put on the respondent in trying to understand the meaning of what is asked. For example, encountering different font sizes and colors across questions forces the respondent to relearn their meaning every time they are used.
Also, presenting scales with different directions (positive to negative or vice versa) in rating questions within the same survey increases measurement error as respondents often assume all rating questions have the same scale direction even when the instructions explain the meaning of the end points of the scale. For instance, if a preference question using a 1-7 scale where 1 means “the most preferred” is followed by an importance question, also using a 1-7 scale, but where 1 means “the least important,” respondents who are not paying attention to the instructions (which is quite common) are likely to assume that the 1 in the importance question means “the most important.” I have seen many examples of this problem, when respondents are asked a follow-up question conditioned on their previous answers and then they realize their mistake and tell us they actually meant to say the opposite.
Based on the research object, both the type of information requested and the question format are important for the type of analysis we plan to perform once the data is collected. If you want to develop a customer satisfaction model using linear regression analysis and the dependent variable is an open-ended question, you can forget about modeling anything. This seems obvious, but I have seen non-researchers writing questionnaires without thinking how they will analyze the data and then come to me asking for analyses that are not appropriate for the data collected.
There is also the question of whether we want to replicate the results, track certain events or just run a one-time ad hoc analysis. If the goal is to track certain metrics, time and care should be dedicated to crafting tracking questions, as slight changes in wording can change the meaning of a question and thus its results.
ON YOUR WAY
If you take each of these aspects of survey writing into consideration, you will be on your way to creating surveys that produce valid data and can support with confidence strategic and tactical decisions for your business.
To learn more about our consumer data service visit Consumer Shopping Behavior Insights. To request consumer shopping behavior data and insights don’t hesitate to contact us.