Entity-Relationship Query (ERQ) is an entity-centric, structured query mechanism.
You can query named entities by telling what kinds of entities you want and what are the relations among them.
The current demo supports 0.75 million entities collected from Wikipedia. These entities are organized into 10 types,
namely Person, Company, University, City, Novel, Player, Club, Song, Film and Award. The 2008-07-24 snapshot of Wikipedia
is used as the corpus.
Quick Examples:
Four Types of Queries in ERQ
- SS: Single-Predicate Query (Selection Predicate)
- SJ: Single-Predicate Query (Join Predicate)
- MS: Multi-Predicate Query (without Join Predicate)
- MJ: Multi-Predicate Query (with Join Predicate)
The
INEX17 query set
(converted from INEX topics) only contains SS and MS queries.
OWN28 contains SS, MS and MJ queries.
We don't design SJ queries in OWN28 because such queries are not as common as the other three types.
A short guide on how to compose your queries can be found
here.
Brief Comparison with Other Approaches/Systems
- The DB-based approach pre-extracts structured information from text into databases to support SQL queries on entities. Systems taking this approach (e.g. ExDB) are restricted by the capability of information extraction (IE) and natural language processing (NLP) technologies. Facts that are not extracted are lost and thus cannot be queried in database. ERQ circumvents the problem by directly querying over texts, rather than extracted facts.
Detailed comparison between ERQ and the state-of-the-art IE system, TextRunner is provided for INEX17 query set.
- The Semantic Web approach encodes entity-relationship information, as well as general knowledge, in RDF format, and enables the expressive SPARQL queries coupled with reasoning power, which is lacking in ERQ. However, currently RDF data are either manually collected or automatically extracted using IE/NLP. The former limits system scalability while the latter suffers from similar problems in DB-based approach.
- IR-based approach is exemplified by INEX Entity Ranking track. It focuses on finding entities according to narrative descriptions, where the descriptions are
often treated as term vectors as in traditional IR systems. INEX queries do not embody the notion of predicate and thus unstructured. The newly formed Entity track in TREC is similar to INEX in this sense.
- ERQ takes the DB+IR approach.
On the one hand, ERQ supports SQL-like structured queries, consisting of multiple predicates.
On the other, the predicates are defined by keywords as in IR queries.
We acknowledge that, the effectiveness of our approach partially relies on the user's capability in providing proper keyword constraints, just like in IR queries.
Some related works (e.g., EntityRank) is similar to ERQ in the sense that their queries are composed with keywords (as single-predicate queries in ERQ, not as narrative descriptions in INEX or TREC). However, those systems do not support multi-predicate queries.
Besides, they only focus on precisions at top-few ranks, while in ERQ we attempt to maintain good precisions in longer range.
How to Choose Keywords for your predicate?
In order to get best possible results from ERQ, you should choose your keywords/phrases wisely
for predicates. The rule-of-thumb is to
- use the words that are commonly used for expressing such kind of information, and
- use as few keywords as possible.
For example, if you are looking for novels written by Jane Austen, there could be
four options for specifying a predicate
- Option 1 by Jane Austen
- Option 2 author Jane Austen
- Option 3 written by Jane Austen
- Option 4 Jane Austen
Option 1 is better than option 2, because when people mention about a novel and its author,
it's probably more common to say
- Pride and Prejudice, a novel written by Jane Austen,rather than
- Pride and Prejudice, whose author is Jane Austen
Since ERQ only recognizes explicitly stated facts (i.e., a novel co-occurs with the predicate keywords),
option 1 may capture more sentences and answers, and consequently yield better recall quality.
option 3 is not as good as option 1 because many sentences may just say
- Pride and prejudice, a novel by Jane Austen
without the word 'written'. Avoiding the extra 'written' is for the same purpose above (to capture more sentences as evidences).
Option 4 is also recommended for two reasons. First, it can recognize all sentences that can be captured
by all three other options, thus the best recall. Second, normally, when a novel co-occurs with Jane Austen,
most likely the novel is written by Jane Austen, thus it makes sense to avoid the extra word 'by'.
To try out an example query, please click the query link (the query will be auto-filled in the interface)
and click GO.
This set contains 17 queries converted from topics in INEX 2009 Entity Ranking track.
Single-Predicate Queries (Selection Predicate)
To help better understand the difference between ERQ and DB-based approach, we compared ERQ with the state-of-the-art IE system,
TextRunner.
Basically, for each query, we manually archived the true answers returned by ERQ and TextRunner for comparison.
It can be observed that in many cases, TextRunner has less recall. There are two possible reasons:
(1) some facts are not stated in its corpus at all or;
(2) the extraction phase fails to extract certain facts.
NOTE:This result doesn't mean ERQ is "better" than TextRunner. They are different approaches and have different focuses: (1)TextRunner focuses on the extraction of relations themselves, thus cannot query facts that are not extracted. We rely on the user to form appropriate query predicates, although we don't require pre-extraction; (2) We support multi-predicate queries and we aim at better precision at relatively large ranks instead of top few answers, which is not their focus; (3) The two systems use different corpora.
- TextRunner V1 is the most comprehensive demo with facts extracted from 500 million high quality Web pages.
- TextRunner V2
contains facts extracted only from Wikipedia. The exact Wikipedia snapshot used by this version is not known to us.
We used the keyword search interface provided by TextRunner and convert ERQ queries into TextRunner-friendly queries in order to get the maximum recall from it.
For example, if we are looking for novels by Neil Gaiman, we used by "Neil Gaiman" in ERQ query predicate, but Neil Gaiman for TextRunner.
Both versions of TextRunner were tried. Since version 2 hardly return any answers for all the queries, we do not show there results here.
To try TextRunner V2 by yourself, click the Show advanced search options in its homepage, and turn on the Wikipedia only option.
Multi-Predicate Queries (wthout Join Predicate)
These queries are multi-predicate queries without join predicate.
Since TextRunner demo does not support multiple predicates, hence there is no comparison with TextRunner for these queries.