Discovery Queries: Exploring an Unknown Schema

When you inherit a RushDB project — from a teammate, a data import, or a legacy integration — your immediate problem is: what is in here and how is it connected?

This tutorial walks a systematic exploration workflow that works in code, in the MCP server, and in the REST API. It teaches how to trust your findings and build reliable queries from scratch without guessing at label names or property keys.

Why schema discovery matters

RushDB does not enforce a fixed schema. Labels and properties are inferred from what you insert. That means:

Labels can vary in casing (Article, ARTICLE, article) depending on who wrote the ingest code
Properties may be absent on some records
Relationship types may be present in one direction but not the other
You cannot assume anything from type names alone — verify before querying

The ontology tools give you a live view of what exists.

Step 1: Get the ontology

The ontology describes every label and property that exists in your project, including property types and relationship paths.

Python
TypeScript
shell

from rushdb import RushDB
import os, json

db = RushDB(os.environ["RUSHDB_API_KEY"], base_url="https://api.rushdb.com/api/v1")

ontology = db.ai.get_ontology()
print(json.dumps(ontology, indent=2))

markdown = db.ai.get_ontology_markdown()
print(markdown)

import RushDB from '@rushdb/javascript-sdk'

const db = new RushDB(process.env.RUSHDB_API_KEY!)

// Returns labels, properties, and their types in structured JSON
const ontology = await db.ai.getOntology()
console.log(JSON.stringify(ontology, null, 2))

// Returns the same data formatted as compact Markdown — useful for pasting to an LLM
const markdown = await db.ai.getOntologyMarkdown()
console.log(markdown)

BASE="https://api.rushdb.com/api/v1"
TOKEN="RUSHDB_API_KEY"
H='Content-Type: application/json'

# JSON ontology
curl -s -X POST "$BASE/ai/ontology" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{}'

# Markdown ontology
curl -s -X POST "$BASE/ai/ontology/md" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{}'

Read the output carefully before writing any queries. Note every label name exactly as it appears — casing is significant.

Step 2: Sample records from each label

After identifying labels, pull a small sample from each to see real data shapes.

Python
TypeScript
shell

labels = ["CUSTOMER", "ORDER", "PRODUCT"]  # exact names from ontology

for label in labels:
    sample = db.records.find({
        "labels": [label],
        "limit": 3
    })
    print(f"\n=== {label} ({sample.total} total) ===")
    for r in sample.data:
        print(r.data)

// Sample 3 records from each label you found
const labels = ['CUSTOMER', 'ORDER', 'PRODUCT'] // exact names from ontology

for (const label of labels) {
  const sample = await db.records.find({
    labels: [label],
    limit: 3
  })
  console.log(`\n=== ${label} (${sample.total} total) ===`)
  console.log(JSON.stringify(sample.data, null, 2))
}

for LABEL in CUSTOMER ORDER PRODUCT; do
  echo "=== $LABEL ==="
  curl -s -X POST "$BASE/records/search" \
    -H "$H" -H "Authorization: Bearer $TOKEN" \
    -d "{\"labels\":[\"$LABEL\"],\"limit\":3}"
done

Step 3: Discover distinct values for key properties

Before filtering, check what values actually exist for categorical fields. Use select + groupBy to enumerate distinct values.

Python
TypeScript
shell

status_counts = db.records.find({
    "labels": ["ORDER"],
    "select": {
        "count":  {"$count": "*"},
        "status": "$record.status"
    },
    "groupBy": ["status", "count"],
    "orderBy": {"count": "desc"}
})

print("ORDER statuses:")
for row in status_counts.data:
    print(f"  {row.data.get('status')}: {row.data.get('count')}")

// Find all distinct statuses on ORDER records
const statusCounts = await db.records.find({
  labels: ['ORDER'],
  select: {
    count:  { $count: '*' },
    status: '$record.status'
  },
  groupBy: ['status', 'count'],
  orderBy: { count: 'desc' }
})

console.log('ORDER statuses:')
for (const row of statusCounts.data) {
  console.log(`  ${row.status}: ${row.count}`)
}

curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{
    "labels": ["ORDER"],
    "select": {
      "count":  {"$count": "*"},
      "status": "$record.status"
    },
    "groupBy": ["status", "count"],
    "orderBy": {"count": "desc"}
  }'

Do this for every categorical field you encounter. This prevents you from querying status: 'active' when the actual values are ACTIVE, Active, or 1.

Step 4: Verify a relationship exists before traversing it

Ontology shows you that a relationship path exists. Before building a traversal query, confirm direction and label names by searching for records that have a specific relationship.

Python
TypeScript
shell

customers = db.records.find({
    "labels": ["CUSTOMER"],
    "where": {
        "ORDER": {
            "$relation": {"type": "PLACED_ORDER", "direction": "out"}
        }
    },
    "limit": 1
})

if not customers.data:
    reverse = db.records.find({
        "labels": ["CUSTOMER"],
        "where": {
            "ORDER": {
                "$relation": {"type": "PLACED_ORDER", "direction": "in"}
            }
        },
        "limit": 1
    })
    print("Reverse direction hit:", len(reverse.data) > 0)

// Test: do CUSTOMER records have outbound PLACED_ORDER relationships to ORDER?
const customers = await db.records.find({
  labels: ['CUSTOMER'],
  where: {
    ORDER: {
      $relation: { type: 'PLACED_ORDER', direction: 'out' }
    }
  },
  limit: 1
})

if (customers.data.length === 0) {
  // Try reverse direction or different type name
  const reverse = await db.records.find({
    labels: ['CUSTOMER'],
    where: {
      ORDER: {
        $relation: { type: 'PLACED_ORDER', direction: 'in' }
      }
    },
    limit: 1
  })
  console.log('Reverse direction hit:', reverse.data.length > 0)
}

# Test outbound direction
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{
    "labels": ["CUSTOMER"],
    "where": {
      "ORDER": {
        "$relation": {"type": "PLACED_ORDER", "direction": "out"}
      }
    },
    "limit": 1
  }'

Step 5: Identify nullable properties with `$exists`

When a property appears in some records but not others, use $exists: false to find records missing it and $exists: true to confirm presence before filtering.

Python
TypeScript
shell

missing_email = db.records.find({
    "labels": ["CUSTOMER"],
    "where": {"email": {"$exists": False}},
    "select": {"count": {"$count": "*"}},
    "groupBy": ["count"]
})

has_email = db.records.find({
    "labels": ["CUSTOMER"],
    "where": {"email": {"$exists": True}},
    "select": {"count": {"$count": "*"}},
    "groupBy": ["count"]
})

print(f"Missing email: {missing_email.data[0]['count'] if missing_email.data else 0}")
print(f"Has email: {has_email.data[0]['count'] if has_email.data else 0}")

// How many CUSTOMER records are missing an email property?
const missingEmail = await db.records.find({
  labels: ['CUSTOMER'],
  where: { email: { $exists: false } },
  select: { count: { $count: '*' } },
  groupBy: ['count']
})

// How many have email?
const hasEmail = await db.records.find({
  labels: ['CUSTOMER'],
  where: { email: { $exists: true } },
  select: { count: { $count: '*' } },
  groupBy: ['count']
})

console.log(`Missing email: ${missingEmail.data[0]?.count ?? 0}`)
console.log(`Has email: ${hasEmail.data[0]?.count ?? 0}`)

# Missing email
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{"labels":["CUSTOMER"],"where":{"email":{"$exists":false}},"select":{"count":{"$count":"*"}},"groupBy":["count"]}'

# Has email
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{"labels":["CUSTOMER"],"where":{"email":{"$exists":true}},"select":{"count":{"$count":"*"}},"groupBy":["count"]}'

Step 6: Build the reliable query bottom-up

After the above steps you know:

exact label names and their counts
real property names and value shapes
which categorical values are in use
which properties may be absent
relationship type names and their directions

Now build your query incrementally, adding one filter at a time and verifying results at each step.

Python
TypeScript
shell

# Incremental query building
base = db.records.find({"labels": ["ORDER"], "limit": 1})
print("Base query:", base.total)

with_status = db.records.find({
    "labels": ["ORDER"],
    "where": {"status": "shipped"},
    "limit": 1
})
print("With status filter:", with_status.total)

with_customer = db.records.find({
    "labels": ["ORDER"],
    "where": {
        "status": "shipped",
        "CUSTOMER": {
            "$relation": {"type": "PLACED_ORDER", "direction": "in"},
            "region": "EU"
        }
    },
    "limit": 1
})
print("With customer traversal:", with_customer.total)

// Step 1: verify base label works
const base = await db.records.find({ labels: ['ORDER'], limit: 1 })
console.log('Base query:', base.total)

// Step 2: add first filter
const withStatus = await db.records.find({
  labels: ['ORDER'],
  where: { status: 'shipped' },  // confirmed from Step 3
  limit: 1
})
console.log('With status filter:', withStatus.total)

// Step 3: add traversal
const withCustomer = await db.records.find({
  labels: ['ORDER'],
  where: {
    status: 'shipped',
    CUSTOMER: {
      $relation: { type: 'PLACED_ORDER', direction: 'in' },
      region: 'EU'
    }
  },
  limit: 1
})
console.log('With customer traversal:', withCustomer.total)

// Step 4: switch to production query
const result = await db.records.find({
  labels: ['ORDER'],
  where: {
    status: { $in: ['shipped', 'delivered'] },
    CUSTOMER: {
      $relation: { type: 'PLACED_ORDER', direction: 'in' },
      region: 'EU'
    }
  },
  orderBy: { createdAt: 'desc' },
  limit: 50
})

# Incremental: base
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{"labels":["ORDER"],"limit":1}'

# Add status filter
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{"labels":["ORDER"],"where":{"status":"shipped"},"limit":1}'

# Add traversal
curl -s -X POST "$BASE/records/search" \
  -H "$H" -H "Authorization: Bearer $TOKEN" \
  -d '{"labels":["ORDER"],"where":{"status":"shipped","CUSTOMER":{"$relation":{"type":"PLACED_ORDER","direction":"in"},"region":"EU"}},"limit":1}'

MCP server workflow equivalent

If you are using the MCP server with Claude or Cursor, follow the same pattern via tool calls:

Call getOntologyMarkdown — read and understand the output before anything else
Call getSearchQuerySpec — load the canonical query structure for the LLM's context
Use findRecords with limit: 3 for sampling, not limit: 1000
Use select + groupBy queries to enumerate categorical values
Add traversal filters one at a time, verifying counts match expectations

The getQueryBuilderPrompt tool returns a system prompt you can paste into any LLM session to enforce the ontology-first discipline described in this tutorial.

Production caveat

Ontology output reflects the current state of the database. In a live system it changes over time. If you are building an automated pipeline that depends on a specific label or property, add an existence check at startup and fail fast if the expected ontology elements are absent. Discovery workflows are interactive; production code should be explicit.

Next steps

Agent-Safe Query Planning with Ontology First — automate the discovery loop in an agent
MCP Quickstart for Real Operators — the same workflow via the MCP server
Thinking in Graphs — mental model for working with connected data

Why schema discovery matters​

Step 1: Get the ontology​

Step 2: Sample records from each label​

Step 3: Discover distinct values for key properties​

Step 4: Verify a relationship exists before traversing it​

Step 5: Identify nullable properties with $exists​

Step 6: Build the reliable query bottom-up​

MCP server workflow equivalent​

Production caveat​

Next steps​