Are Google's linguistic prosthesis biased towards commercially more interesting expressions? A preliminary study on the linguistic effects of autocompletion algorithms
July 17, 2013, 08:30 | Short Paper, Embassy Regents F
Commodification of words
Displaying relevant ads next to non-paid relevant search results is part of Google's original, highly successful business model (Kaplan, 2011). Advertisers bid on certain keywords they want their ad associated with and pay only if Google displays [1] their ad. Which ads are displayed thus does not solely depend on the query keywords, but also on the bidding price advertisers have been willing to pay for the association of their ads to these keywords, as well as the quality score Google has attributed to their ad (Google, 2012a; Kaplan 2011; Lee 2011, p. 439). Advertising accounts for 97% of Google's revenues, which represented about 3 billion dollars per months in 2012/2011 (Singel 2011). By making advertisers bid on certain keywords to advertisers, Google has commodified words (Lee 2011).
The value of keywords change over time. Keywords are traded within Google's Second Price Auction (Edelman et al. 2007). Google's Diane Tang, creator of the Keyword Pricing Index, illustrates the different trading values of various keywords with the following examples: there are generally “very competitive keywords, like 'flowers' and 'hotels'”, whereas other keywords may cost generally little or — such as “snowboarding”, more expensive in winter — vary seasonally (Levy 2009). As a result, there exist different bidding strategies for advertisers (cf. D'Avanzo et al. 2011, p. 143-147 for a discussion of the most common strategies). However, Lee (2011) points out that — although online marketing literature about possible and recommended bidding strategies is abundant — “none of the studies found in advertising journals adopted a critical perspective” (p. 438).
Google-ese, Googlais and Googlisch
The fact that words have become a commodity with different monetary values and can be “bought” from Google implies that there is commodified language, which can be represented in a lexicon containing all the words and expressions which are actually “bought”. The lexicon of the commodified derivate of the English language is Google-ese. [2] (Analogical to the English language, Googlais is Google's commodified derivate of the French language, Googlisch its German equivalent etc.; other ad-selling search engines with potentially different algorithms are associated with different lexica, leading to e.g. Bingese, Bingais and Bingisch for Bing.) Google-ese therefore consists of a certain proportion of the roughly 500''000 dictionary entries (McCrum et al. 2001), some non-dictionary English words like names and places, foreign language expressions and “non-words” such as acronyms, misspelled and mistyped words etc.
Search engine biases
The very existence of Google-ese, Googlais, Googlisch and the like — i.e. specific keywords bought by advertisers and marketers — account for the company's financial success. Therefore, the “importance of pleasing the advertisers and marketers who support Google and other search engines can hardly be underestimated”, underlines Hinman (2008, p. 70). There is “potential for abuse” (p. 73 ff.), notably in the form of “subtle biases” when it comes to search results and search in general (p. 71). Since commodified language is the baseline for a search engine's revenue, there is reason to explore potential biases when it comes to the treatment of words.
Many scholars from different fields have demonstrated the existence of search engine bias empirically (e.g. Edelman 2011), analytically and (e.g. Diaz 2008) conceptually (e.g. Hinman 2008), overall adopting a critical viewpoint (cf. also Lawrence 2009, Zimmer 2009 p.7, Pariser 2011). Studies addressing search engine bias refer mostly to issues of information access, i.e. search engines as gatekeepers (cf. also Gasser 2006), as well as issues of knowledge shaping (Grimmelman 2009, Hinman 2008). But although that there is general acknowledgement of search engines' impact on access to and classification of knowledge, researchers agree that there has been little to no research focus on search engines' impact on society (Hargittai 2007, p.769; Lewandowski 2012, p.5; Spink and Zimmer 2008, p.344; Zimmer 2009, p. 516-517), let alone search engine's impact on language itself.
Linguistic Prosthesis
Certain functions of Google search are visibly impacting language by transforming the initial search query: “related searches”, for instance, associates our initial keyword to other keywords we might not have thought of; “Did you mean” suggests alternatives to our initial keyword, which was either not correctly spelled or not popular (Google 2012b); auto-completion suggests ways of finishing our initial keywords on our behalf while we are typing according to “purely algorithmic factors (including popularity of search terms)” (Google 2012c). These functions act as a mediation between our thoughts (i.e. the initially intended query) and its expression. We suggest to call them linguistic prosthesis (Kaplan 2011). If search engine biases manifest in linguistic prosthesis, the expression of our thoughts is seamlessly transformed by Google's algorithms.
Modeling of linguistic prosthesis and corresponding commercial value
To progress in the understanding of the effects of these linguistic prosthesis, we have started a systematic and periodic modeling of two important functions. The fist function A(x) -> {s} models the association between a string x (a partial word or sentence) in the search engine query field and a set of string {s} (autocompletion). The second function V(s) evaluates the suggested bidding value for a given string s. We would like to measure whether the value V(s) is influencing the suggested set of strings given by the function A(x). Both functions may, of course, vary over time, and we have to measure A(x) and V(s) as time-dependent in order to document their relation at a given time t and, potentially, their evolution and possible correlation.
One obvious difficulty is the size of the space we need to monitor and the scale of all measurements: it is nearly impossible to test all the possible x entries at regular intervals over time. However, this scalability issue is very similar to a well-documented problem in the field of optimal experiment design (Fedorov 1972), addressed by artificial intelligence researchers for at least 20 years (Schmidhuber 1991). When a space is too big to explore in its entirety and when, in addition, each trial is costly – which is exactly our situation – one needs to choose smartly what query to test. In the context of their research in open-ended learning systems, Oudeyer and Kaplan have designed a optimal experiment design algorithm that performs precisely this task (details can be found in Oudeyer et al 2007, Kaplan and Oudeyer 2009): instead of trying random configuration, the algorithm detects situations in which its predictions progress maximally, and it then chooses the input signal in order to optimize its own progress. Following this principle, the algorithm running the measurements of the functions A(x) and V(s) avoids "uninteresting" subspaces in order to focus on the actions which are most likely to bring progress. Typically, it will focus its “attention” on subspaces of query strings with significant change in return value as measured by V(s). A daily script thus selects a set Q of n queries each day based on the optimal design algorithm. This produces a set S of results suggestions. For each s of S, we re-test the Value V(s).
Our ongoing experiment, focusing on the commodified lexicon Google-ese, derivated from the English language, is being conducted during one year. In that timeframe we hope to elaborate — at least on certain subspaces — a sufficiently good model of the two functions and their evolution over time to test various possible correlations between the two. These models will then be made public in form of a structured corpus, enabling long-time analysis and further studies by other research groups.
This preliminary study will not permit to assess whether or not — and if so: how — Google's linguistic economy is impacting natural languages. It will, however, allow us to make first educated guesses on the linguistic effects of autocompletion algorithms and keyword bidding. In a broader context, this research is an example of how academia can study technological "black boxes", such as search engines' algorithms, without accessing their inner workings. We believe that by their properties (i.e. enabling us to explore big, costly spaces) optimal experiment design algorithms are of great pertinence for this kind of "reverse engineering" modeling, and such research is likely to become of crucial societal relevance within the coming years.