Entity-Relationship Queries

Entity-Relationship Query (ERQ) is an entity-centric, structured query mechanism. You can query named entities by telling what kinds of entities you want and what are the relations among them. The current demo supports 0.75 million entities collected from Wikipedia. These entities are organized into 10 types, namely Person, Company, University, City, Novel, Player, Club, Song, Film and Award. The 2008-07-24 snapshot of Wikipedia is used as the corpus.

Quick Examples:

Four Types of Queries in ERQ

SS: Single-Predicate Query (Selection Predicate)
SJ: Single-Predicate Query (Join Predicate)
MS: Multi-Predicate Query (without Join Predicate)
MJ: Multi-Predicate Query (with Join Predicate)

The INEX17 query set (converted from INEX topics) only contains SS and MS queries. OWN28 contains SS, MS and MJ queries. We don't design SJ queries in OWN28 because such queries are not as common as the other three types. A short guide on how to compose your queries can be found here.

Brief Comparison with Other Approaches/Systems

The DB-based approach pre-extracts structured information from text into databases to support SQL queries on entities. Systems taking this approach (e.g. ExDB) are restricted by the capability of information extraction (IE) and natural language processing (NLP) technologies. Facts that are not extracted are lost and thus cannot be queried in database. ERQ circumvents the problem by directly querying over texts, rather than extracted facts. Detailed comparison between ERQ and the state-of-the-art IE system, TextRunner is provided for INEX17 query set.
The Semantic Web approach encodes entity-relationship information, as well as general knowledge, in RDF format, and enables the expressive SPARQL queries coupled with reasoning power, which is lacking in ERQ. However, currently RDF data are either manually collected or automatically extracted using IE/NLP. The former limits system scalability while the latter suffers from similar problems in DB-based approach.
IR-based approach is exemplified by INEX Entity Ranking track. It focuses on finding entities according to narrative descriptions, where the descriptions are often treated as term vectors as in traditional IR systems. INEX queries do not embody the notion of predicate and thus unstructured. The newly formed Entity track in TREC is similar to INEX in this sense.
ERQ takes the DB+IR approach. On the one hand, ERQ supports SQL-like structured queries, consisting of multiple predicates. On the other, the predicates are defined by keywords as in IR queries. We acknowledge that, the effectiveness of our approach partially relies on the user's capability in providing proper keyword constraints, just like in IR queries. Some related works (e.g., EntityRank) is similar to ERQ in the sense that their queries are composed with keywords (as single-predicate queries in ERQ, not as narrative descriptions in INEX or TREC). However, those systems do not support multi-predicate queries. Besides, they only focus on precisions at top-few ranks, while in ERQ we attempt to maintain good precisions in longer range.

How to Choose Keywords for your predicate?

In order to get best possible results from ERQ, you should choose your keywords/phrases wisely for predicates. The rule-of-thumb is to

use the words that are commonly used for expressing such kind of information, and
use as few keywords as possible.

For example, if you are looking for novels written by Jane Austen, there could be four options for specifying a predicate

Option 1 by Jane Austen
Option 2 author Jane Austen
Option 3 written by Jane Austen
Option 4 Jane Austen

Option 1 is better than option 2, because when people mention about a novel and its author, it's probably more common to say

Pride and Prejudice, a novel written by Jane Austen,rather than
Pride and Prejudice, whose author is Jane Austen

Since ERQ only recognizes explicitly stated facts (i.e., a novel co-occurs with the predicate keywords), option 1 may capture more sentences and answers, and consequently yield better recall quality. option 3 is not as good as option 1 because many sentences may just say

Pride and prejudice, a novel by Jane Austen

without the word 'written'. Avoiding the extra 'written' is for the same purpose above (to capture more sentences as evidences). Option 4 is also recommended for two reasons. First, it can recognize all sentences that can be captured by all three other options, thus the best recall. Second, normally, when a novel co-occurs with Jane Austen, most likely the novel is written by Jane Austen, thus it makes sense to avoid the extra word 'by'.

To try out an example query, please click the query link (the query will be auto-filled in the interface) and click GO.

This set contains 17 queries converted from topics in INEX 2009 Entity Ranking track.

Single-Predicate Queries (Selection Predicate)

To help better understand the difference between ERQ and DB-based approach, we compared ERQ with the state-of-the-art IE system, TextRunner. Basically, for each query, we manually archived the true answers returned by ERQ and TextRunner for comparison. It can be observed that in many cases, TextRunner has less recall. There are two possible reasons: (1) some facts are not stated in its corpus at all or; (2) the extraction phase fails to extract certain facts.

NOTE:This result doesn't mean ERQ is "better" than TextRunner. They are different approaches and have different focuses: (1)TextRunner focuses on the extraction of relations themselves, thus cannot query facts that are not extracted. We rely on the user to form appropriate query predicates, although we don't require pre-extraction; (2) We support multi-predicate queries and we aim at better precision at relatively large ranks instead of top few answers, which is not their focus; (3) The two systems use different corpora.

TextRunner V1 is the most comprehensive demo with facts extracted from 500 million high quality Web pages.
TextRunner V2 contains facts extracted only from Wikipedia. The exact Wikipedia snapshot used by this version is not known to us.

We used the keyword search interface provided by TextRunner and convert ERQ queries into TextRunner-friendly queries in order to get the maximum recall from it. For example, if we are looking for novels by Neil Gaiman, we used by "Neil Gaiman" in ERQ query predicate, but Neil Gaiman for TextRunner. Both versions of TextRunner were tried. Since version 2 hardly return any answers for all the queries, we do not show there results here. To try TextRunner V2 by yourself, click the Show advanced search options in its homepage, and turn on the Wikipedia only option.

1.	Find who won World Chess Champion titles.	see TextRunner
2.	Find novels written by Neil Gaiman.	see TextRunner
3.	Find Hugo Award Best Novel winning novels.	see TextRunner
4.	Find films staring Tom Hanks.	see TextRunner
5.	Find Monaco Grand Prix winners.	see TextRunner
6.	Find actress who play the roles of bond girls.	see TextRunner
7.	Find films directed by Akira Kurosawa.	see TextRunner
8.	Find Booker Prize winning novels.	see TextRunner
9.	Find Catalan universities.	see TextRunner
10.	Find films about football hooligans.	see TextRunner
11.	Find novels written by Paul Auster.	see TextRunner

Multi-Predicate Queries (wthout Join Predicate)

These queries are multi-predicate queries without join predicate. Since TextRunner demo does not support multiple predicates, hence there is no comparison with TextRunner for these queries.

1.	Find Japanese players who played in Major League Baseball.
2.	Find German cities that were in the hanseatic league.
3.	Find Itatdan Nobel Prize winners.
4.	Find Alan Mooer's novels that were adapted to films.
5.	Find a poet who won the Nobel Prize.
6.	Find explorers who searched for Australia.

To try out an example query, please click the query link (the query will be auto-filled in the interface) and click GO.

This set contains 28 manually designed queries.

Single-Predicate Queries (Selection Predicate)

1.	Find novels written by Jane Austen.
2.	Find football players who were FIFA Player of the Year.
3.	Find companies acquired by Oracle.
4.	Find films directed by Steven Spielberg.
5.	Find US Open champions.
6.	Find Turing Award recipients.
7.	Find Eagles songs.
8.	Find staring Robert De Niro.
9.	Find computer game companies.
10.	Find football players who trasfered to Real Madrid.
11.	Find capitals of states in US.
12.	Find persons born in Spain.
13.	Find Celine Dion songs.
14.	Find Pulitzer Prize for Drama winners.
15.	Find province capitals in China.
16.	Find compaines in Silicon Valley.

Multi-Predicate Queries (without Join Predicate)

1.	Find Turing Award recipients who are affitdated with IBM.
2.	Find US presidents who graduated from Harvard.
3.	Find brazilian football players who played for Real Madrid.
4.	Find Australian actors who whon some Best Actress award.
5.	Find computer game companies acquired by Microsoft.

Multi-Predicate Queries (with Join Predicate)

1.	Find Nobel Prize winners and Big Ten universities. The winners held professorship in the universities.
2.	Find films staring Robert De Niro and please tell directors of these films.
3.	Find films and australian actors. The film are an Academy Award winning film staring the actors.
4.	Find companies and their founders. The company must be in Silicon Valley and the founders are Stanford University graduates.
5.	Find football players and italian football clubs. The player was an FIFA Player of the Year and joined the clubs sometime.
6.	Find NBA champion teams and their leading players who won the NBA final MVP.
7.	Find novels and their Academy Award winning film adaptations.