> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getelyra.xyz/llms.txt
> Use this file to discover all available pages before exploring further.

# Polymarket research: opportunities and arbitrage API

> Scan up to 800 active Polymarket prediction markets, cluster them semantically, score by edge, and surface the top N by expected value.

Elyra's Polymarket research pipeline fetches active markets from the Polymarket Gamma API in paginated batches, clusters them by semantic similarity to find related markets, detects mispricing relative to cluster peers, and ranks opportunities by a composite score of liquidity, volume, and probability deviation. The result is a structured report of the top trading opportunities, arbitrage candidates, and mispriced markets — ready for programmatic consumption or terminal output.

## CLI usage

Run the pipeline from the command line using `main.py` with the `polymarket` command, or invoke the module directly.

```bash theme={null}
# Top 5 opportunities (default)
python3 main.py polymarket

# JSON output
python3 main.py polymarket --json

# Custom parameters
python3 main.py polymarket --top 10 --max-markets 800

# Direct module
python3 -m skills.trade_research.trade_research --top 5 --json
```

### CLI flags

<ParamField path="--top" type="integer">
  Number of top opportunities and arbitrage rows to return per table. Defaults to `5`.
</ParamField>

<ParamField path="--max-markets" type="integer">
  Maximum number of active markets to fetch from the Polymarket Gamma API before analysis begins. Defaults to `600`. The API is paged in batches of 200; fetching stops early if fewer markets are returned than the batch size.
</ParamField>

<ParamField path="--json" type="flag">
  Print raw JSON to stdout instead of the formatted Rich table. Pipe this output to `jq` or any JSON processor for downstream use.
</ParamField>

## Python usage

Call `run_research` directly from your own code. It returns the same structured dict that the CLI serialises to JSON.

```python theme={null}
import asyncio
from skills.trade_research.trade_research import run_research

result = asyncio.run(run_research(max_markets=600, top_n=5))
```

### Parameters

<ParamField path="max_markets" type="integer">
  Maximum markets to fetch before analysis. Passed through to `fetch_all_markets`. Defaults to `600`.
</ParamField>

<ParamField path="top_n" type="integer">
  Number of rows to include in each output section (`top_opportunities` and `arbitrage`). Defaults to `5`.
</ParamField>

## Return value

`run_research` returns a `dict` with three top-level keys.

```json theme={null}
{
  "top_opportunities": [...],
  "arbitrage": [...],
  "mispriced_markets": [...]
}
```

### `top_opportunities`

An array of ranked trading opportunities, sorted by a composite score of liquidity, volume, and probability deviation from cluster peers. If the detector finds fewer scored opportunities than `top_n`, the remaining slots are filled with the highest-activity markets by `log(liquidity) × log(volume)`.

<ResponseField name="rank" type="integer">
  Position in the ranked list, starting from `1`.
</ResponseField>

<ResponseField name="market_id" type="string">
  Polymarket market identifier (condition ID or numeric ID from the Gamma API).
</ResponseField>

<ResponseField name="question" type="string">
  Market question text, truncated to 80 characters with a trailing `...` if longer.
</ResponseField>

<ResponseField name="yes_price" type="float">
  Current YES outcome price as a decimal between `0` and `1`.
</ResponseField>

<ResponseField name="no_price" type="float">
  Current NO outcome price as a decimal between `0` and `1`.
</ResponseField>

<ResponseField name="liquidity" type="float">
  Total on-book liquidity in USD.
</ResponseField>

<ResponseField name="volume" type="float">
  Total traded volume in USD.
</ResponseField>

<ResponseField name="reason" type="string">
  Pipe-separated list of detection signals that triggered this opportunity, such as `prob_diff_vs_cluster=0.18 | low_liq_volume_spike` or `yes_plus_no=1.04`. High-activity fill-ins carry `high_liquidity_volume (activity)`.
</ResponseField>

<ResponseField name="score" type="float">
  Composite score used for ranking: `log(liquidity) / log(max_liquidity) × mispricing × log(volume) / log(max_volume)`. Higher is better. Fill-in rows score `0.0`.
</ResponseField>

<ResponseField name="url" type="string | null">
  Direct Polymarket event URL (`https://polymarket.com/event/{slug}`) when a slug is available; `null` otherwise.
</ResponseField>

### `arbitrage`

An array of arbitrage candidates and watchlist entries, ranked by total implied probability descending. The detector flags any market where `YES + NO ≥ 1.01` as a `same_market` arbitrage. When fewer than `top_n` structural opportunities exist, the list is padded with `same_market_relaxed` entries (threshold `1.005`), then `price_sum_deviation` watchlist entries, then high-liquidity leaders.

<ResponseField name="rank" type="integer">
  Position in the ranked list, starting from `1`.
</ResponseField>

<ResponseField name="type" type="string">
  Classification of the entry. One of: `same_market`, `same_market_relaxed`, `price_sum_deviation`, or `liquidity_leader`.
</ResponseField>

<ResponseField name="market_ids" type="array of strings">
  Market IDs involved. Single-market entries contain one ID; multi-leg entries list all legs.
</ResponseField>

<ResponseField name="questions" type="array of strings">
  Question text for each market in `market_ids`, truncated to 60 characters.
</ResponseField>

<ResponseField name="total_probability" type="float">
  Sum of YES and NO prices (`YES + NO`). Values above `1.0` indicate a potential arbitrage; values below indicate a correlated discount.
</ResponseField>

<ResponseField name="profit_potential_pct" type="float">
  Estimated gross profit as a percentage of capital deployed, calculated as `(total_probability − 1.0) × 100`. Does not account for trading fees or slippage.
</ResponseField>

<ResponseField name="details" type="string">
  Human-readable description of the signal, for example `YES=0.58 + NO=0.46 = 1.04` or a watchlist note to verify executable prices.
</ResponseField>

<ResponseField name="url" type="string | null">
  Polymarket event URL for single-market entries; `null` for multi-leg entries or when no slug is available.
</ResponseField>

### `mispriced_markets`

Markets whose YES price deviates from the mean YES price of their semantic cluster by at least `0.15`. These are candidates for mean-reversion trades within a thematic group.

<ResponseField name="market_id" type="string">
  Polymarket market identifier.
</ResponseField>

<ResponseField name="question" type="string">
  Market question text.
</ResponseField>

<ResponseField name="yes_price" type="float">
  Current YES price on this market.
</ResponseField>

<ResponseField name="cluster_mean_yes" type="float">
  Mean YES price across all other markets in the same semantic cluster.
</ResponseField>

<ResponseField name="mispricing" type="float">
  Absolute deviation `|yes_price − cluster_mean_yes|`, rounded to 3 decimal places. The detection threshold is `0.15`.
</ResponseField>

## Semantic clustering

Before scoring, the pipeline groups markets by topic using cosine similarity on question text. It prefers `sentence-transformers/all-MiniLM-L6-v2` and falls back to TF-IDF (via scikit-learn) when the library is unavailable. Markets are merged into a cluster when their pairwise similarity meets the threshold (`0.75` for sentence-transformers, `0.60` for the TF-IDF fallback). Clusters with fewer than two members are discarded.

Clustering is used to compute `cluster_mean_yes` for mispricing detection and to populate the `mispriced_markets` list. It does not affect the arbitrage detector, which operates on individual market price sums.

<Note>
  The first run downloads `sentence-transformers/all-MiniLM-L6-v2` from Hugging Face Hub and caches it to `.cache/huggingface/` inside your project root. This download is roughly 90 MB and only happens once. To skip it entirely, omit `sentence-transformers` from your environment; the pipeline will use TF-IDF clustering automatically.
</Note>
