Skip to main content

Discovery Queries: Exploring an Unknown Schema

When you inherit a RushDB project — from a teammate, a data import, or a legacy integration — your immediate problem is: what is in here and how is it connected?

This tutorial walks a systematic exploration workflow that works in code, in the MCP server, and in the REST API. It teaches how to trust your findings and build reliable queries from scratch without guessing at label names or property keys.


Why schema discovery matters

RushDB does not enforce a fixed schema. Labels and properties are inferred from what you insert. That means:

  • Labels can vary in casing (Article, ARTICLE, article) depending on who wrote the ingest code
  • Properties may be absent on some records
  • Relationship types may be present in one direction but not the other
  • You cannot assume anything from type names alone — verify before querying

The ontology tools give you a live view of what exists.


Step 1: Get the ontology

The ontology describes every label and property that exists in your project, including property types and relationship paths.

from rushdb import RushDB
import os, json

db = RushDB(os.environ["RUSHDB_API_KEY"], base_url="https://api.rushdb.com/api/v1")

ontology = db.ai.get_ontology()
print(json.dumps(ontology, indent=2))

markdown = db.ai.get_ontology_markdown()
print(markdown)

Read the output carefully before writing any queries. Note every label name exactly as it appears — casing is significant.


Step 2: Sample records from each label

After identifying labels, pull a small sample from each to see real data shapes.

labels = ["CUSTOMER", "ORDER", "PRODUCT"]  # exact names from ontology

for label in labels:
sample = db.records.find({
"labels": [label],
"limit": 3
})
print(f"\n=== {label} ({sample.total} total) ===")
for r in sample.data:
print(r.data)

Step 3: Discover distinct values for key properties

Before filtering, check what values actually exist for categorical fields. Use select + groupBy to enumerate distinct values.

status_counts = db.records.find({
"labels": ["ORDER"],
"select": {
"count": {"$count": "*"},
"status": "$record.status"
},
"groupBy": ["status", "count"],
"orderBy": {"count": "desc"}
})

print("ORDER statuses:")
for row in status_counts.data:
print(f" {row.data.get('status')}: {row.data.get('count')}")

Do this for every categorical field you encounter. This prevents you from querying status: 'active' when the actual values are ACTIVE, Active, or 1.


Step 4: Verify a relationship exists before traversing it

Ontology shows you that a relationship path exists. Before building a traversal query, confirm direction and label names by searching for records that have a specific relationship.

customers = db.records.find({
"labels": ["CUSTOMER"],
"where": {
"ORDER": {
"$relation": {"type": "PLACED_ORDER", "direction": "out"}
}
},
"limit": 1
})

if not customers.data:
reverse = db.records.find({
"labels": ["CUSTOMER"],
"where": {
"ORDER": {
"$relation": {"type": "PLACED_ORDER", "direction": "in"}
}
},
"limit": 1
})
print("Reverse direction hit:", len(reverse.data) > 0)

Step 5: Identify nullable properties with $exists

When a property appears in some records but not others, use $exists: false to find records missing it and $exists: true to confirm presence before filtering.

missing_email = db.records.find({
"labels": ["CUSTOMER"],
"where": {"email": {"$exists": False}},
"select": {"count": {"$count": "*"}},
"groupBy": ["count"]
})

has_email = db.records.find({
"labels": ["CUSTOMER"],
"where": {"email": {"$exists": True}},
"select": {"count": {"$count": "*"}},
"groupBy": ["count"]
})

print(f"Missing email: {missing_email.data[0]['count'] if missing_email.data else 0}")
print(f"Has email: {has_email.data[0]['count'] if has_email.data else 0}")

Step 6: Build the reliable query bottom-up

After the above steps you know:

  • exact label names and their counts
  • real property names and value shapes
  • which categorical values are in use
  • which properties may be absent
  • relationship type names and their directions

Now build your query incrementally, adding one filter at a time and verifying results at each step.

# Incremental query building
base = db.records.find({"labels": ["ORDER"], "limit": 1})
print("Base query:", base.total)

with_status = db.records.find({
"labels": ["ORDER"],
"where": {"status": "shipped"},
"limit": 1
})
print("With status filter:", with_status.total)

with_customer = db.records.find({
"labels": ["ORDER"],
"where": {
"status": "shipped",
"CUSTOMER": {
"$relation": {"type": "PLACED_ORDER", "direction": "in"},
"region": "EU"
}
},
"limit": 1
})
print("With customer traversal:", with_customer.total)

MCP server workflow equivalent

If you are using the MCP server with Claude or Cursor, follow the same pattern via tool calls:

  1. Call getOntologyMarkdown — read and understand the output before anything else
  2. Call getSearchQuerySpec — load the canonical query structure for the LLM's context
  3. Use findRecords with limit: 3 for sampling, not limit: 1000
  4. Use select + groupBy queries to enumerate categorical values
  5. Add traversal filters one at a time, verifying counts match expectations

The getQueryBuilderPrompt tool returns a system prompt you can paste into any LLM session to enforce the ontology-first discipline described in this tutorial.


Production caveat

Ontology output reflects the current state of the database. In a live system it changes over time. If you are building an automated pipeline that depends on a specific label or property, add an existence check at startup and fail fast if the expected ontology elements are absent. Discovery workflows are interactive; production code should be explicit.


Next steps