Web Information Retrieval
Please visit links in this section and review the following search tips pages for Google Search, DuckDuckGo search, and Bing search.
- Refine web searches—Google Search Help. (n.d.). Retrieved August 3, 2022, from Google
- DuckDuckGo. (n.d.). DuckDuckGo Search Syntax. DuckDuckGo Help Pages. Retrieved August 3, 2022, from DuckDuckGo
- Advanced search options. (n.d.). Retrieved August 3, 2022, from Microsoft
Whether we want to search the web, a bibliographic database, or our Zotero library, it's helpful to know a bit how information retrieval works in order to become good at search.
In this lesson, we'll cover specific search techniques that you can use to get better, but there are three principles that we need to consider when advancing our search skills.
First, we should understand that the basic information retrieval model centers on documents. Anything indexed in a database or on the web is treated as a document. Documents include text, sound, images, video, etc.
Second, we should understand that documents do not exist independently of other documents. Let's call this the corpus. For the web, the corpus is organized like a file system, much like the file system on your personal computers. In a bibliographic database, the corpus is organized by predefined fields, such author names, title names, subject terms, etc.
Third, our queries are not divorced from the documents nor the corpus nor the organization of the corpus. These things are all intertwined. Each time we search, search engines and databases compare documents in the corpus to each other and to how they are organized based on our query constructions, and then rank order (in some way) those documents by way of that comparison. Hence, when we construct queries, it's useful to think about the content (corpus) that we are searching.
To rehash, a search engine or database uses our queries, matches them against how they have indexed the corpus, and then rank orders the results based on the query and the corpus.
In order to illustrate the above concepts, I'll primarily focus on searching the web in this section, but these techniques work across the databases that we use in a library.
The web is for storing and retrieving documents
When we use Google or another search engine to search, we are often looking for documents in a corpus that contain specific text that match our query. This has some implications:
- Text has primacy, even for multimedia, which is often described using text.
- Our queries are matched against the text that appears in documents on the web or that describes documents on the web.
- The better our queries match the documents, the better (or more precise) our search results will be. (This assumes we can construct good queries.)
- The more ambiguous our queries are, the more work the search engine has to do to find relevant results.
- The challenge with search is that we do not always know what text a document contains even if that document covers the topic or concept that we think is relevant.
- For example, consider synonyms. We might want web pages that contain terms that are synonymous with our query term but do not actually contain our query term. But this can get complicated. If I search for star, could I also mean principal, lead, hero, celebrity, stellar, or sun?
- What if a document only uses terms like principal, lead, hero, celebrity, stellar, or sun? Might it still be useful if I was interested in documents (i.e., web pages) about star? Probably not since some of those words, although synonymous with star, are not synonymous with each other (cf, principal and celebrity)
- Other wordy issues include things like homonyms, which are words that are pronounced or spelled the same but mean different things. Thus, by bark, do I mean the bark on a tree or the word we use to signal the sound a dog makes?
- Phrases are also important, with respect to term adjacency and word order. If I search for forest fire, search engines are more likely to return results where those two terms appear next to each other and in that order. This will mean that documents that contain text about someone having a camp fire in a forest will be less likely to appear at the top of my results than a document that contains the phrase the forest is on fire. Or documents where the terms are in order but spaced far apart will also rank lower. E.g., if the term forest appears in the first paragraph on a web page and the term fire appears on the last paragraph of a web page, this web page will rank lower than a page that contains the terms near or adjacent to each other.
The web is organized
I mentioned above that the web is like a file system, the kind that you'd find on your own computer. By this I mean that the web is organized, even though it may not appear so. If we know a bit about its organization, we can take good advantage of that when we search. For instance, we can narrow our searches to parts of the web. So the questions are: how is it organized? And how can we use that in search?
- The web is organized like a tree. This tree like structure originally contained a few main branches, called top level domains (TLDs). Example TLDs are .com, .edu, .org, .gov, .mil, and .net. All domains then branched off of those main branches.
- The tree has grown over the years and now contains about 1500 of these main branches (TLDs). Newer TLDs include .apple, .attorney, .camera, .green, .joy .mobile, .office, .science, .space, and many more.
- Included in those are ccTLDs, or country code top level domains. For
- .kr for South Korea
- .ae for the United Arab Emirates
- .fr for France
- .us for United States
- Each of the big branches contains smaller branches, called second level
domains. For example:
- Under .com is google for google.com
- Under .edu is uky for uky.edu.
- Under .org is wikipedia for wikipedia.org,
- Under .gov is usa for usa.gov,
- Under .apple is newsroom for newsroom.apple,
- and so on.
- Those branches (second level domains) contain even smaller branches that are
called third level domains or subdomains. Examples include:
- maps for maps.google.com
- calendar for calendar.uky.edu
- en for en.wikipedia.org
- analytics for analytics.usa.gov
- www for www.uky.edu
- ci for ci.uky.edu
We can take advantage of this organization
by limiting (or focusing) queries to results
within smaller sections of the web. In Google, this would entail using
what is called the
We use this in combination with our search queries.
For example, let's say I do a search for the term
I notice that most of the results
that I'm interested in are from
and most of the results I am less interested in
To focus on the gov domains,
I add the site operator to my query.
This is the Google query I could use to search
for the topic flu only on .gov sites:
Then perhaps I find these results too general still. For example, let's say I live in Kentucky but Google keeps showing me .gov sites from other states. We can focus on just a smaller branch of the tree. E.g., if I wanted to focus only on results from Kentucky, then:
Or I can specify a part of Kentucky's government, like the Cabinet for Health and Family Services or the Kentucky Department of Education with these queries:
flu site:chfs.ky.gov flu site:education.ky.gov
Constructing queries for precise results
There are all sorts of other tips and tricks we can apply to revise and make our queries more precise.
Use quotes around our search terms
With a search like
Google provided us with snippets of text
that highlighted where the term flu appears
in the web pages that are retrieved.
For example, we see terms like these in the search results:
- flu symptoms
- flu activity,
- flu vaccines, and
- flu season.
This tells us important information about how Google sees the text on web pages, and we can use this information to revise our search. For example, let's say I'm interested in web pages that contain info about flu vaccines and less interested in pages that contain information on flu activity or flu season. If that's the case, then I can add the additional term to my query and enclose the whole query in quotation marks. That will force Google to rank pages with the literal term "flu vaccines" much higher than pages with those other terms or phrases, or exclude those other pages altogether. So our query will now look like this (note: only the query terms, and not the site operator, are quoted):
"flu vaccines" site:gov
Get more recent pages
If I'm really interested only in recent pages, I can click on the Tools button and select Any time, Past hour, Past 24 hours, Past week, or etc, to limit results to certain time periods.
Exclude results with the minus sign
Let's take a look at our flu vaccine search. Instead of enclosing "flu vaccines" in quotes to return only pages with that phrase and to reduce pages retrieved with other phrases, I could exclude the other phrases altogether (i.e., activity and season) by excluding them with a minus sign. To exclude the terms activity and season from our search results, this is how our search would look:
flu -activity -season site:gov
I can also exclude specific domains or specific websites:
flu vaccines -site:com flu vaccines -site:webmd.com
Term order matters
Results will be different depending on the order of the query terms. Google has gotten good over the years about natural language (how we talk in real life), and so the suggestion is to use natural language in your query. For example, it's generally better to use the search terms flu vaccine instead of vaccine flu, since the former is how we'd phrase the terms in English. This will of course vary by language. In many Romance languages, but also others, it's common (but varies) for the modifier to come after the word being modified. For example, in Spanish, we would say shirt red:
Thus, a Spanish speaker
would want to search for
camisa roja and not
This of course is regardless of the country of origin,
but note that Google has separate landing pages for
Google.com depending on the country you're located in.
For example, for those residing in Mexico, google.com
directs to google.com.mx,
where mx is the ccTLD for Mexico:
Google Search - Mexico.
For those residing in Canada, it's google.ca.
Although term order can determine meaning or reflect natural language, we can pair some terms together as lists, without any impact on meaning or natural language. How we pair them might change the results retrieved, though, which becomes noticable as we scan the search results lists. For example, consider the following two searches:
The above two search queries are semantically equivalent (they mean the same thing) and their order is arbitrary in our list, but search engines implicitly place a priority on term order. The first term in the query is prioritized over the second term. So if you search Google using the above two queries, the first page of results might be mostly the same, but as you page through them, you might see more pages on Google than on Bing, for the first search, and vice versa for the second search.
One OR the other OR both
When we search using multiple terms,
we can use the
OR operator to tell Google to return pages
with either of the terms or both of the terms.
Consider the following two searches:
"google" "bing" google OR bing
The first search will return pages with both the terms included in the results because the quotes enforce that.
The second search will return pages with either the term google in the results, the term bing in the results, or both the terms in the results. Note also, based on my personal experience, that if I test the second search in Google Search and also in Bing Search, then Google will return more results about Google, and Bing will return more results about Bing, respectively.
There are other operators we can use in search engines, and many of them work regardless of which search engine we use. Here are some examples that you can test in Google or elsewhere:
- :related to find related sites
- :filetype to return results in specific types of files:
- search uky.edu for the term flu vaccine but only retrieve PDFs:
flu vaccines filetype:pdf site:uky.edu
- same as above but only return Microsoft Word files:
flu vaccines filetype:docx site:uky.edu
- same idea as above but only return Microsoft Excel files:
birth weight filetype:xlsx site:gov
- search uky.edu for the term flu vaccine but only retrieve PDFs:
Information retrieval (or search) can be complex but fun. Remember the three principals we stared with in this section, and apply those principals when constructing your queries.
- document centered
- consider the text
- no document is an island
- consider the document with respect to the corpus
- the web is organizated
- take advantage of the how the web is structured with site searches
Remember and practice the techniques I discussed here:
- query construction
- use quotes to force exact matches
- exclude terms with the minus sign
- term order matters
- use OR to select alternate terms
If you forget anything, use advanced search: https://www.google.com/advanced_search
Or Advanced Image Search: https://google.com/advanced_image_search
Google provides a list of some of these operators.
Other search engines also have search operators, and often they're the same:
You can get very advanced with your queries. Here are some examples:
trade ("surplus" OR "deficit") (site:whitehouse.gov OR site:congress.gov)
Or, limit to specific filetypes:
trade ("surplus" OR "deficit") (site:whitehouse.gov OR site:congress.gov) filteype:pdf
The last search query is so complicated that it decomposes into the following separate queries but searches them all at the same time:
trade surplus site:whitehouse.gov filetype:pdf
trade surplus site:congress.gov filetype:pdf
trade deficit site:whitehouse.gov filetype:pdf
trade deficit site:congress.gov filetype:pdf
trade surplus site:congress.gov site:whitehouse.gov filetype:pdf
trade deficit site:congress.gov site:whitehouse.gov filetype:pdf
trade surplus deficit site:congress.gov site:whitehouse.gov filetype:pdf
trade surplus deficit site:congress.gov filetype:pdf
trade surplus deficit site:whitehouse.gov filetype:pdf
trade surplus deficit site:congress.gov site:whitehouse.gov filetype:pdf