1.2 Studying language scientifically

We said that linguistics is the scientific study of human language and discussed what we mean by “human language”. Let us now explain what we mean by “scientific study”. When we say that linguistics is a science, that doesn’t mean you need a lab coat and a microscope to do linguistics — although there are branches of lingustic research that use laboratory instruments, such as phonetics (see Chapter 3) or psycholinguistics. What it means is that we think about language using the scientific way of thinking.

The scientific way of thinking about language involves systematic, empirical study. The word empirical means that we base our ideas about language on data that we gather by observing how people use their language in natural settings, or by giving them linguistic tasks in a laboratory and recording their responses. In this way, linguistics is no different from other sciences — entomologists observe the life cycles and habitats of insects, chemists observe how substances interact, linguists observe how people use language. Just like entomologists and chemists, linguists aim for an accurate description of the phenomenon they’re studying. And like other scientists, linguists strive to make observations that are not value judgments. If an entomologist observes that a certain species of beetle eats leaves, she’s not going to judge that the beetles are eating wrong, and tell them that they’d be more successful in life if only they ate the same thing as ants. The same is true of linguists — we do not go around telling people how they should or shouldn’t use language. Or at least, that is what would be the case in an ideal world. Of course, like all scientists, and like all humans, linguists have biases that often prevent us from reaching this ideal — unlike entomologists or chemists, we are not impartial observers of organisms or substances different from ourselves, but we are part of the very thing we study — users of one or more languages, members of one or more language communities with all the cultural biases that other members of these communities have. Thus, it is more difficult for us to adhere to the scientific ideal, and we must try harder to do so.

What makes things even more difficult is that language communities tend to have their own traditions of thinking and talking about language, and those traditions can be at odds with the scientific approach to language. In many language communities, there is a prescriptive tradition of telling people how they should or should not use language. Such an approach seems natural to members of such communities — in part, because it is woven into our education systems —, and they may find the purely descriptive approach that linguists strive for irritating.

To illustrate the difference between the two approaches, take the way that plurals are formed in English. A first approximation of a description of English plurals could be the following:

Adding -s to a noun allows it to refer to more than one instance of what the noun refers to — for example, apple/apples,book/books, dog/dogs, virus/viruses or formula/formulas.

This is not a complete description yet, but it accounts for a large part of the linguistic behavior of English speakers forming plurals.

A prescriptive statement, in contrast, would look like this:

Because the word virus is derived from Latin, you should pluralize it as viri, not viruses.

First, note that this statement is wrong even in the part where it talks about Latin — there is no attested plural form of virus in the Latin texts we have, but since it is a neuter noun, the expected Latin form would be vira. Second, this is completely irrelevant, since virus, when used by a speaker of English, is an English word, not a Latin word, and speakers of English use the plural form viruses. The only speakers who say viri are people who have let themselves be convinced by prescriptivists against their own better judgment.

Of course, there can be situations where different speakers do different things. Take the word formula — it was included in the descriptive statement above as an example of a word whose plural is formed by the addition of an -s, and in fact, many speakers of English use this plural. However, others use the plural formulae, which is, indeed, the Latin plural. What do we do as linguists in this situation? Do we tell people that formulae is correct because that is the Latin form? Or do we tell people that formulas is correct because that is how plurals are normally formed in English? The answer is, we do neither. It is not our job to tell anyone how to use their language, but to observe and describe how they use it naturally. Thus, we would first state that some speakers use formulas and some use formulae, and we would then try to determine if there are reasons for this variation — does it depend on the region where a speaker is from, does it depend on the kind of setting they are in, etc.

So when we’re doing linguistics, our goal is to make descriptive, empirical observations of language. In doing so, we have two problems, a practical one and a theoretical one. The practical problem concerns the availability of data. If we are interested in a particular question, such as what the plural of the English word formula is, it would be very inconvenient if we had to walk around and wait for speakers to speak about more than one formula — it is a rare word even in the singular form. Linguists get around this problem by collecting vast amounts of natural language data in electronic form that we can then search for words (or other linguistic phenomena) using a computer. We could also perform a simple experiment and simply ask a number of speakers — this is easy enough for a simple question, but it becomes more and more difficult the larger and more complex our questions become. Linguists do rely on experiments quite a bit, but such experiments are time-consuming and expensive.

The theoretical problem concerns the interpretation of data: as linguists, we are interested not just in the specific linguistic behaviors that people display in a specific situation, but in the subconscious linguistic knowledge that guides these behaviors. That knowledge cannot be observed directly. We cannot just cut open language users’ heads in the way our entomologist friend might dissect a beetle, and even if we could, we would not find any linguistic knowledge there. Thus, we have to deduce a model of the subconscious knowledge of language users from their behavior (in the wild or in the laboratory). This is a very difficult task indeed.

Metalinguistic knowledge as a source of empirical data

One solution that many linguists propose to both of the problems mentioned above is to access our own metalinguistic awareness as fluent speakers or signers of a language. As mentioned above, as linguists we are part of what we study — why not turn this into an advantage? We already have the subconscious knowledge that we are after — so the argument goes —, all we have to do is access that knowledge.

This is an attractive idea, and to some extent we can make it work. Here’s an example of accessing your metalinguistic awareness. Say you want to create a new English word for a character in a game. Are you going to call your cute little creature a blifter or a lbitfer? Neither of those forms exists in English, but they both use sounds that are part of the sound system of English. Yet, you probably have a strong feeling that blifter is an okay name for your new creature, while lbitfer is a pretty terrible name. Notice that your sense that lbitfer is wrong is not based on prescriptive ideas — it’s not that it sounds rude or you’ll get in trouble for combining those sounds that way. It is based on an intuitive knowledge that it just … can’t happen. You’ve made a descriptive observation that lbitfer is not a possible word in English. From that observation, we can conclude that lbitfer violates some part of the subconscious knowledge of the English language that fluent language users have.

Since many linguists use the term mental grammar to describe that subconscious knowledge — including the knowledge not just of what we colloquially call grammar, but also of the sounds, words and bits of words —, we could say that the word blifter is grammatical and the word lbitfer is ungrammatical in English. An ungrammatical word or phrase or sentence is something that just can’t exist in a particular language: the mental grammar of that language does not contain any rules or representations that would allow language users to produce it. Thus, grammaticality isn’t about what actually exists in a language; it’s about what could exist. In this example, neither blifter nor lbitfer exist in English, but although they have the same sounds in them blifter could be an English word and lbifter couldn’t.

It’s often useful to compare similar words, phrases or sentences to try to access our metalinguistic awareness. Let’s look at another example of observing what’s possible, this time from what we would usually call grammar. Here are two similar sentences:

(1a)
Sam compared the forged painting with the original.
(1b)
Sam compared the forged painting and the original.

As fluent language users of English, we intuitively know that both of these are possible sentences in English (they are both grammatical).

Now let us turn these sentences into questions — something that we, as fluent language users of English, can do without giving it any thought at all:

(2a)
Did Sam compare the forged painting with the original?
(2b)
Did Sam compare the forged painting and the original?

Observing those two questions, we can see that, again, both (2a) and (2b) are acceptable in English.

Now let’s try a different kind of question:

(3a)
What did Sam compare the forged painting with?
(3b)
*What did Sam compare the forged painting and?

Comparing these two sentences, we intuitively know that (3a) is possible (grammatical), but (3b) is not. Linguists normally use an asterisk (*) in front of a sentence to indicate that they have determined it to be ungrammatical based on their own metalinguistic awareness. Inventing a word, phrase or sentence and then determining whether it is grammatical based on our own metalinguistic awareness is often referred to as producing grammaticality judgments or acceptability judgments.

Some linguists treat such grammaticality judgments as empirical data. They would state that the two similar sentences discussed here are both possible as declarative statements (1a, b) and as yes-no questions (2a, b), but when we try to make a wh-question out of them, the result is acceptable for the first one (3a) but not for the second one (3b). Having made that observation, they would now try to figure out what’s going on in the mental grammar that can account for this observation. Why is (3a) grammatical but (3b) isn’t?

While this procedure works in very simple cases like the one presented here, we should use it extremely sparingly. It should be used only in cases where our judgments are very clear and consistent, and where other users of the language share them without hesitation. This is rarely the case once we look at anything but the simplest phenomena. And even then, we should remain skeptical — it is very easy to convince yourself and others that something that you want to be true is actually true. Overall, you should not regard grammaticality judgments as empirical data, but as a short-cut that allows us, in very restricted situations, to skip the process of collecting actual empirical data.

Proper sources of empirical data

Grey zip-up hoodie on white background.

Figure 1.4.1. Hoodie.

Even in simple cases, you might not want to rely on your own grammaticality judgments, and, of course, you cannot rely on your own judgments if you are dealing with a language, or a language variety, that you do not speak fluently. Instead, you can (or even have to) use a survey — a questionnaire that you distribute on paper or as an online form — to gather grammaticality judgments.

We can also use surveys for other purposes, for example, for learning about regional language variation. We can elicit the words that people use for particular items in different places. From survey data we know that some people call the item in Figure 1.4.1 a sweatshirt, other people call it a hoodie, and people in Saskatchewan call it a bunny hug. If you’re studying regional and social variation you might also gather data using in-person interviews, in which you could ask questions like, “Does the ‘u’ in student sound like the ‘oo’ in too or the ‘u’ in use?”.

As mentioned above, linguists also use huge collections of natural language to make language observations. Such a collection is called a corpus. Corpora may contain a specific type of texts — for example, fiction, academic texts, newspapers, social media or recorded conversations —, or a mixture of different types of texts meant to represent the language as a whole. There are many preexisting corpora that are routinely used in linguistic research, but the internet and modern computing have made it very easy to collect your own corpora for specific questions and to annotate and search these corpora using a variety of software tools. There are also software tools for other types of observational linguistic research, for example, for annotating and analyzing audio and video recordings of speakers and signers.

Corpora also allow us to study variation across regions, and they are special in that, for languages that have a written record, they allow you to study variation across time retrospectively — you can collect texts published at different points in the past to study earlier  stages of a language, or change across different stages. You can also study differences in different types of text. For example, you could test the hypothesis that formulae was the original plural form in English and that speakers began to replace it by formulas as knowledge of Latin became less widespread. You would find that this is not true: In British English, the two plural forms were roughly equally frequent in the 18th and 19th century, formulae became much more frequent in the 20th century, and formulas taking the lead in the 21st century. In American English, formulas always was the vastly more frequent form and remained so throughout the 20th century — thus, at least for a while, the two plural forms could be considered a dialectal difference. This difference holds across all text types, but in American English, the form formulae is found around a third of the time in academic writing and prose fiction, but almost never in newspapers, magazines or spoken language. This suggests that, in American English, there was an influence of register (with formulae existing as an alternative to formulas in formal registers only).

If we are interested in the mental representation of language, we can also draw on techniques from behavioural psychology and conduct experiments. You might measure language users’ reaction times and reading times for words and sentences, or ask participants to listen to words that are mixed with white noise. Some experiments use eye-tracking to measure people’s eye movements while reading a text, watching a signer, or listening to a speaker. It’s even possible to use neural imaging techniques like electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) to observe brain activity during language processing.

When you’re starting out in linguistics, it’s often really exciting to use the scientific method to think about grammar, as you start to see that grammar is not just a set of arbitrary rules to memorize so you sound “proper”. Even if we’re not peering through a microscope wearing a lab coat, the tools of language science allow us to make systematic observations of how humans use language. And we can interpret those observations to draw conclusions about the human mind.

 

CC-BY-NC-SA 4.0, Adapted from Anderson, Catherine, Bronwyn Bjorkman, Derek Denis, Julianne Doner, Margaret Grant, Nathan Sanders, and Ai Taniguchi, Essentials of Linguistics. 2nd ed., with rewriting and extensions by Anatol Stefanowitsch.