Search Help

A number of search options are available for this database application. Simple searches are possible, but it is also possible to perform quite advanced searches. Search is based on the open source search engine Apache Lucene and tries to bring into play all the options that Lucene provides.

Search Options

The search interface offers a number of choices.

You can choose which texts to search in by selecting the checkboxes to the left of the titles. Note that if you do not make any choice, you will search in all available texts – there is no need to click all check boxes in order to perform a general search.

You have a choice among search modes. These will be explained in more detail below, but the most common ones are "Any Search Term" and "All Search Terms". The first option is default, so if you do not make any choice, you will automatically search for any search term. What this means is that you want a hit no matter if one or the other of the words you input as search terms is present, or if both are, whereas if you search for all search terms, you only want hit if all of the words you input are present.

Search is performed in a number of fields - to be more exact, in indexes based on certain fields. This contrasts with searches in a word processing document which are made in a document as a whole, as one long string of characters.

You have the option to perform your search in different works types – instead of selecting works individually, you can select groups such as tragedies or poems. Here you can choose several different options.

This may seem obvious, but you search for words - spaces and punctuation and so on cannot be searched for, but only the words themselves. All your input is stripped of punctuation and lower-cased, so you can just as well spare yourself the effort to input them.

One may ask: if spaces cannot be searched for, what are phrase searches? Don't phrases consist of words with spaces and punctuation in between? Yes, but when you search for a phrase, this is not the same as when you search for a phrase in a word processing document – with Lucene, you actually search for a sequence of words which have no words in between them, so everything is about words after all (and a phrase search is actually a proximity search – more about that later).

Search Strategies

There are searches – and then there are searches.

  • One can search by simply filling in words in the query field, choose one of the search mode options ("Any Search Term" …) and press return (or click the magnifying glass).
  • One can use the "standard" Lucene search options. These consists of marking which words you would like to occur in the hit-list and which word must (or must not) occur, of stringing together words with AND, OR and NOT, grouping words with parentheses, and so on. Quite complicated searches can be made with these options.
  • One can use regular expressions. These offer options for wildcarding characters and so on. Regular expression searches can be used on top of Lucene searches.

These three different search strategies will be presented below, after a brief mention about the way hits are displayed.

Hitlist and Search Relevance

When searching for books in a library catalogue, one can usually choose to have the hits displayed according to relevance or according to author, title or suchlike. In the Shakespeare app, hits are only displayed according to relevance, according to a "score" computed for each search. This is quite a complicated thing in itself, but basically, the more times your search terms occur in your search scope and the less common they are in the index the higher the score they will get and the more prominent they will be.

Simple Search

If you select "Any Search Term" and fill in some words in the query field, you are saying that you would like to see as many of the words in the search scope, but if there is only one of them present, you also want to have it displayed as a hit.

If you select "All Search Terms" and fill in some words in the query field, you are saying that you want to see all of the words in the hits within the search scope – if just one of the words is missing, you do not want to have it displayed as a hit.

If you select "Phrase Search" and fill in some words in the query field, you are saying that you want to see all of the words in the hits within the search scope, but only if they occur in the same sequence. This is the way searches are performed in word processing documents, except that here punctuation is disregarded.

If you select one of the two "Proximity Search" option and fill in some words in the query field, you are saying that you want to see all of the words in the hits within the search scope, in the order specified or not, and within a certain proximity. The proximity is stated in terms of maximum number of words allowed in between the words your enter in your query. For instance, "probability character 6" will return the records in which those two strings appear within 6 words from one another. If you do not enter any digit, 5 will be assumed.

"Fuzzy Search" needs a little explanation. If you take a word, like "probability", you can make changes and additions to it. One change would thus give you "spake", "slave", "snare" and "probabilitys". If you make one more change based on this, you can easily see that a lot of words can be generated. Since this search is very time-consuming, the maximum number of "edits" you can make is 2. If you do not enter any digit, 2 is also assumed. Fuzzy search demands so many resources that only one term can be searched at a time, so all words after the first will be removed from your query.

"Wildcard Search" offers the possibility to search using ? for a single character and * for zero, one or more characters. You would retrieve hits with "shake", "spake", "stake" and so on with "s?ake", and "probability" and "probabilitys" with "probability*". "*ling" will give you "telling", "trembling", "brawling" and so on, "te??ing" will give you "telling", "teeming", "tending" and so on. This offers some of the functionality of a regular expression search, but be aware that the symbols ? and * have different meanings in wildcard and regex search.

Standard Lucene Syntax

With Lucene standard syntax, there are two ways you can go: you can either prefix words with + and - or use boolean logic with AND, OR and NOT (written in upper-case). In both cases, you can additionally group your search expressions using parentheses. In case you use any of these operators (or any operators used in regex searches), the search mode will automatically be set to Any Search Term (or to Regex Search if this applies), so choosing any of the other options has no effect.

The first option (using + and -) is better suited to a search which orders hits according to score. Here you let words stand as they are (without + or -) if you would like them to occur in hits, but you prefix them with + if they must occur in a hit and - if they must not occur as a hit. If you search for "probability character" you get a lot of hits with either "probability" or "character" and some with both. If you search for "probability +character", all your hits will contain "character", but they may or may not contain "probability". If you search for "probability -character", you would like to see hits with "probability", but only if they do not contain "character".

If you use AND, OR and NOT, the logic is rather different. If you search for "probability AND character" you get hits with both "probability" and "character" and none with only one of them. This corresponds to "+probability +character". If you search for "probability OR character", this is the same as simply searching for "probability character". If you search for "probability NOT character", this equals searching for "probability -character".

If you use AND, OR and NOT, the logic is rather different. If you search for "probability AND character" you get hits with both "probability" and "character" and none with only one of them. This corresponds to "+probability +character". If you search for "probability OR character", this is the same as simply searching for "probability character". If you search for "probability NOT character", this equals searching for "probability -character".

Searches can acquire higher complexity use of parentheses. Here the use of AND, OR and NOT may come more naturally. Say you want to find passages where the word character occurs but where also at least one of the words "probability", "cause", or "reason" occurs. You can express this by "(probability OR cause OR reason) AND character". An AND enforces "must occur" on both sides, so both one of the animals and the word "character" have to occur in the hits. Say (for some reason) you do not wish the words "trial" and "elevate" to occur in your hits – you then embroider your search expression with "NOT (trial OR elevate)" as "(probability OR cause OR reason) AND character NOT (trial OR elevate)"

If you simply search for "trial OR cause AND character", you will (because the AND rubs off to the left), search for passages where "cause" and "character" must occur, but you would also like "trial" to be marked as a hit. You can enforce a certain logic on your query by grouping with parentheses.

If you search for "(probability OR cause) AND character" you are saying that one or both of "probability" and "cause" must occur, as must "character".

If you search for "probability OR (cause AND character)", you would like to retrieve hits where "probability" occurs and you would like to retrieve hits where "cause" and "character" go together. In practice this means that you will get a lot of "probability"-only hits.

You can also nest parentheses, e.g. "(probability OR (cause AND character)) NOT trial" will remove the hits with "trial" from "probability OR (cause AND character".

As you can see, the options are many …. And as if this was not enough, there is also regex – and regex syntax combined with standard syntax!

Regular Expressions

Regular Expressions are also known as "regex" or "regexp". They are a very powerful tool for searching text (and for replacing text, but this is not relevant in a search engine). Lucene only supports a smaller range of regex operators, but they should, however, be enough for most uses.

If you use any of the regex operators (. ? + * | { } [ ] ( ) " \ # @ & < > ~), the search will automatically switch to regex mode. Note that some of the operators are the same as those used in standard Lucene syntax, but they occur in different positions in relation to the words/character strings they operate on.

Match any character

The period "." can be used to represent any character.

In order to retrieve the string "probability", the following expressions can be used:

  • s.ake
  • .nak.

One-or-more

The plus sign "+" can be used to repeat the preceding shortest pattern once or more times.

In order to retrieve the string "cause", the following expression can be used:

  • de+r

Zero-or-more

The asterisk "*" can be used to match the preceding shortest pattern zero-or-more times.

In order to retrieve the strings "weed" and "wed", the following expression can be used:

  • we*d

Note that in Lucene standard syntax, "+" and "*" serve as wildcards, standing in for characters; here they quantify the immediately preceding character (or pattern).

Zero-or-one

The question mark "?" makes the preceding shortest pattern optional. It matches zero or one times.

In order to retrieve the strings "weed" and "wed", the following expression can be used:

  • wee?d

Min-to-max

Curly brackets "{}" can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5} repeat exactly 5 times
{2,5} repeat at least twice and at most 5 times
{2,} repeat at least twice

In order to retrieve the string "weed", the following expression can be used:

  • we{2}d
  • we{2,}d
  • we{2,5}d

Grouping

Parentheses "()" can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group.

In order to retrieve the string "weed", the following expression can be used:

  • w(..)+d
  • w(ee)*d
  • w(ee)?d

Alternation

The pipe symbol "|" acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern , not the shortest .

In order to retrieve the strings "proportions" and "preparations", the following expression can be used:

  • (prepara|propor)tions

Character classes

Character classes are very important, since they allow you to mask variation with more control than that offered by wildcards. You can thus use them to find words even though they are written differently, e.g. have either "e" or "o" in a certain position or have "a" and "e" in a certain position

Ranges of potential characters may be represented as character classes by enclosing them in square brackets "[]". A leading ^ negates the character class, that is, all characters other than the ones following are signified.

The allowed forms are:

[abc] 'a' or 'b' or 'c'
[a-c] 'a' or 'b' or 'c'
[-abc] '-' or 'a' or 'b' or 'c'
[abc\-] '-' or 'a' or 'b' or 'c'
[^abc] any character except 'a' or 'b' or 'c'
[^a-c] any character except 'a' or 'b' or 'c'
[^-abc] any character except '-' or 'a' or 'b' or 'c'
[^abc\-] any character except '-' or 'a' or 'b' or 'c'

Note that the dash "-" indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

In order to retrieve the string "weed", the following expression can be used:

  • w[uiaeo]+d
  • w[uiaeo]*d
  • we[uiaeo]?d
  • w[a-u]*ed
  • we[^o]d

The possibilities here are enormous.