API Reference¶
Assets¶
Alignment asset types for Stargazer.¶
spec: docs/architecture/types.md
Alignment
dataclass
¶
Bases: Asset
BAM/CRAM alignment file asset.
Carries reference_cid and r1_cid for provenance (PROV entity derivation).
Source code in src/stargazer/assets/alignment.py
AlignmentIndex
dataclass
¶
Bases: Asset
BAI/CRAI alignment index file asset.
Carries alignment_cid linking to the Alignment it indexes.
Source code in src/stargazer/assets/alignment.py
BQSRReport
dataclass
¶
Bases: Asset
BQSR recalibration table produced by GATK BaseRecalibrator.
Carries alignment_cid linking to the Alignment it was produced from.
Source code in src/stargazer/assets/alignment.py
Asset base dataclass for Stargazer.¶
spec: docs/architecture/types.md
Asset
dataclass
¶
Base class for all typed file assets in Stargazer.
Attributes:
| Name | Type | Description |
|---|---|---|
cid |
str
|
Content identifier (CID) for the stored file |
path |
Path | None
|
Local filesystem path (set after download or upload) |
keyvalues |
dict[str, str]
|
Free-form metadata for bare Asset instances — the
catchall for records whose asset key has no registered class
in this process. Serialized verbatim (the |
Subclasses declare typed fields as normal dataclass attributes:
@dataclass
class Alignment(Asset):
_asset_key: ClassVar[str] = "alignment"
sample_id: str = ""
duplicates_marked: bool = False
Fields are plain Python attributes. to_keyvalues() serializes them to
dict[str, str] at storage boundaries; from_keyvalues() reconstructs
from storage. str fields pass through directly; all other types use
json.dumps / json.loads.
Source code in src/stargazer/assets/asset.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
__init_subclass__(**kwargs)
¶
__setattr__(name, value)
¶
Enforce declared fields on typed subclasses; pass through on base Asset.
Source code in src/stargazer/assets/asset.py
fetch()
async
¶
Download this asset and all its companions from storage.
Downloads the asset itself, then queries storage for any assets linked
via {_asset_key}_cid to auto-download companions (e.g. indices,
mate reads).
Source code in src/stargazer/assets/asset.py
from_dict(data)
classmethod
¶
Reconstruct from a serialized dict.
Source code in src/stargazer/assets/asset.py
from_keyvalues(kv, cid='', path=None)
classmethod
¶
Reconstruct from a storage keyvalues dict.
str fields are assigned directly; all other types are deserialized
with json.loads (raising json.JSONDecodeError on values that
don't parse — callers that must tolerate malformed records catch it,
see specialize()). Bare Asset receives the keyvalues verbatim.
Source code in src/stargazer/assets/asset.py
to_dict()
¶
to_keyvalues()
¶
Serialize to storage format.
str fields pass through as-is; all other types are serialized with
json.dumps. Bare Asset instances return a copy of their keyvalues
dict verbatim — a copy so callers (e.g. _owner stamping at the
storage layer) can mutate the result without touching the Asset.
Source code in src/stargazer/assets/asset.py
update(path, **kwargs)
async
¶
Upload file and set cid. Shared by all asset types.
Source code in src/stargazer/assets/asset.py
assemble(**filters)
async
¶
Query storage by keyvalue filters and return specialized assets.
The asset filter key accepts a string or list of strings to narrow
by asset type. Other filters are passed through as keyvalue matchers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**filters
|
Any
|
Keyvalue filters. Values may be scalars or lists (cartesian product). |
{}
|
Returns:
| Type | Description |
|---|---|
list[Asset]
|
Flat list of specialized Asset subclass instances. |
Examples:
assets = await assemble(build="GRCh38", asset="reference") ref = next(a for a in assets if isinstance(a, Reference))
assets = await assemble(sample_id="NA12878", asset=["r1", "r2"]) r1 = next(a for a in assets if isinstance(a, R1))
Source code in src/stargazer/assets/asset.py
Read file asset types for Stargazer.¶
spec: docs/architecture/types.md
R1
dataclass
¶
Bases: Asset
R1 (forward) FASTQ read file asset.
Carries mate_cid pointing to the paired R2 asset's CID (None for single-end).
Source code in src/stargazer/assets/reads.py
Reference genome asset types for Stargazer.¶
spec: docs/architecture/types.md
AlignerIndex
dataclass
¶
Bases: Asset
Aligner index file asset (one file per index file for multi-file indices).
Source code in src/stargazer/assets/reference.py
Reference
dataclass
¶
Bases: Asset
Reference FASTA file asset.
Source code in src/stargazer/assets/reference.py
contigs
property
¶
Read contig names from the companion .fai index.
Requires fetch() to have been called first so the ReferenceIndex companion is downloaded alongside this reference.
ReferenceIndex
dataclass
¶
Bases: Asset
FASTA index (.fai) file asset.
Carries reference_cid linking back to the Reference it was built from.
Source code in src/stargazer/assets/reference.py
scRNA-seq asset types for Stargazer.¶
spec: docs/workflows/scrna.md
AnnData
dataclass
¶
Bases: Asset
AnnData (.h5ad) file asset for single-cell RNA-seq data.
Tracks pipeline stage, cell/gene counts, and provenance through the scRNA-seq processing steps.
Source code in src/stargazer/assets/scrna.py
Variant call asset types for Stargazer.¶
spec: docs/architecture/types.md
KnownSites
dataclass
¶
Bases: Asset
Known variant sites VCF used for BQSR.
Standalone asset — carries build and source fields, no container needed.
Source code in src/stargazer/assets/variants.py
KnownSitesIndex
dataclass
¶
Bases: Asset
VCF index (.idx) file for a KnownSites asset.
Carries known_sites_cid linking to the KnownSites VCF it indexes. Fetched automatically alongside the VCF via Asset.fetch().
Source code in src/stargazer/assets/variants.py
VQSRModel
dataclass
¶
Bases: Asset
VQSR recalibration model (.recal file + tranches path).
Produced by VariantRecalibrator. The recal file is the primary path; the companion tranches file path is stored in keyvalues["tranches_path"].
Source code in src/stargazer/assets/variants.py
Variants
dataclass
¶
Config¶
Centralized configuration for Stargazer.¶
Sets environment variable defaults at import time. Consumers read os.environ directly rather than importing named values from this module.
Also the source of truth for the lean per-task Flyte environments
(scrna_env, gatk_env). The user-facing AppEnvironment lives in
infra/app.py alongside the FastAPI application it deploys.
Rules: - PINATA_JWT: No default — absence means no authenticated Pinata. - PINATA_GATEWAY: Defaults to dweb.link if unset. Set to empty string to force a failure on public downloads. - PINATA_VISIBILITY: Defaults to "private" if unset. Only evaluated by PinataClient — if JWT is unset, downloads are always public. - STARGAZER_LOCAL: Local storage directory. Defaults to ~/.stargazer/local.
spec: docs/architecture/configuration.md
log_execution()
¶
Start a per-execution log sink and return the execution ID.
Derives the workflow name from the calling function, fetches the current git commit hash, and creates a dedicated logfile for this execution. Warns if the git tree has uncommitted changes.
Source code in src/stargazer/config.py
Marshal¶
Output marshaling: typed object → dict (for MCP response serialization).¶
spec: docs/architecture/mcp-server.md
marshal_output(value)
¶
Convert a typed Python object to a JSON-friendly structure for MCP transport.
Source code in src/stargazer/marshal.py
Notebooks¶
Stargazer asset and types system tutorial.¶
A guided walkthrough of why typed Asset subclasses exist, the three
layers each Asset has (cid, path, fields), defining a new asset class,
and the round-trip from Python object to storage and back.
SampleSheet is defined here as a marimo reusable top-level symbol
(with app.setup: + @app.class_definition) so the Tasks tutorial
imports this exact class rather than redefining it — one definition, no
drift across the tutorial sequence.
spec: docs/architecture/types.md
SampleSheet
dataclass
¶
Bases: Asset
CSV of per-sample metadata for a cohort.
Source code in src/stargazer/notebooks/tutorials/assets.py
_()
¶
Section 8 — recap.
Source code in src/stargazer/notebooks/tutorials/assets.py
Stargazer execution tutorial.¶
Assumes the asset/task/workflow primitives from the earlier tutorials and
spends them on the headline lesson: running a workflow first locally,
then on a remote cluster with no code changes. It imports the very same
audit_cohorts workflow composed in workflows.py and runs it both ways,
charting the result.
spec: docs/architecture/workflows.md
_(mo, plt, summaries_remote)
¶
Section 5 — chart the per-cohort counts.
Source code in src/stargazer/notebooks/tutorials/execution.py
Stargazer tasks tutorial.¶
Imports the SampleSheet asset from the assets tutorial and walks
through defining a single Flyte task that consumes it and where task
environments live. Running the task — locally and remotely — is saved
for the execution tutorial; composing it into a workflow is the one
before that.
CohortSummary, tutorial_env, and summarize_cohort are defined here
as marimo reusable top-level symbols, so the later tutorials import
these exact objects rather than redefining them — no drift.
spec: docs/architecture/tasks.md
CohortSummary
dataclass
¶
Bases: Asset
JSON summary of a cohort: per-organism counts and totals.
Source code in src/stargazer/notebooks/tutorials/tasks.py
_()
¶
Recap.
Source code in src/stargazer/notebooks/tutorials/tasks.py
summarize_cohort(sheet)
async
¶
Count samples and unique organisms in a cohort sample sheet.
Source code in src/stargazer/notebooks/tutorials/tasks.py
Stargazer workflows tutorial.¶
Imports the summarize_cohort task from the tasks tutorial and composes
it into a fan-out workflow. In Flyte v2 a workflow is just a task that
calls other tasks, so fan-out is plain asyncio.gather over task calls.
audit_cohorts is defined here as a marimo reusable top-level symbol
(with app.setup: + @app.function), so the Execution tutorial runs
this exact workflow object — no copy, no drift.
spec: docs/architecture/workflows.md
_()
¶
Recap and next.
Source code in src/stargazer/notebooks/tutorials/workflows.py
audit_cohorts(sheets)
async
¶
Summarize many cohorts in parallel and return all results.
Source code in src/stargazer/notebooks/tutorials/workflows.py
make_demo_sheets()
async
¶
Build and upload three small cohort SampleSheet assets.
Reusable so the Execution tutorial imports the exact same inputs instead of rebuilding them.
Source code in src/stargazer/notebooks/tutorials/workflows.py
Stargazer scRNA-seq.¶
Multi-sample fan-out preprocessing plus interactive clustering, side-by-side UMAPs, and marker-gene tables. The workflow-tier showcase: the same primitives the tutorials teach, applied as a full-fat scRNA-seq pipeline.
spec: docs/architecture/notebook.md
_(mo)
¶
Links to the tutorial sequence.
Source code in src/stargazer/notebooks/workflows/scrna_pipeline.py
Blank workspace notebook.¶
An empty marimo notebook for authoring from scratch. /workspace/create
copies this seed under your chosen name and injects a [tool.stargazer]
resource block into the header above — edit those values to rightsize the
pod for your workload.
Workspace template — barebones notebook skeleton.¶
A choose-your-own-adventure scaffold for authoring your own notebook from
scratch: ingest a file, define an asset for it, process it with a task,
and fan that task out in a workflow. Each section is a TODO-style
template — pair with assets.py and tasks.py for the
why and the deeper patterns.
spec: docs/architecture/notebook.md
_(mo, my_asset, my_workflow)
async
¶
Run the workflow on your uploaded asset.
Source code in src/stargazer/notebooks/workspace/template.py
Registry¶
Task registry for auto-discovery of Flyte tasks and workflows.
Discovers all tasks from stargazer.tasks and stargazer.workflows modules, extracts parameter types, defaults, and return types for MCP catalog exposure.
spec: docs/architecture/mcp-server.md
TaskInfo
dataclass
¶
Complete metadata about a registered task.
Source code in src/stargazer/registry.py
TaskOutput
dataclass
¶
TaskParam
dataclass
¶
TaskRegistry
dataclass
¶
Discovers and provides access to all Flyte tasks and workflows.
Source code in src/stargazer/registry.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
__post_init__()
¶
get(name)
¶
list_tasks(category=None)
¶
List all registered tasks, optionally filtered by category.
Source code in src/stargazer/registry.py
to_catalog(category=None)
¶
Return a JSON-serializable catalog of all tasks.
Source code in src/stargazer/registry.py
Server¶
Stargazer MCP Server.¶
Exposes storage tools and a dynamic task runner via FastMCP. Tasks and workflows are auto-discovered from the registry and executed through the Flyte local run context.
Usage
stargazer # stdio transport (default) stargazer --http # streamable-http transport
spec: docs/architecture/mcp-server.md
delete_file(cid)
async
¶
download_file(cid)
async
¶
Download a file by CID to local cache. Returns the local path.
fetch_resource_bundle(bundle_name)
async
¶
Download a predefined resource bundle into local storage.
Bundles are curated sets of files (e.g. reference genomes, demo datasets) defined in the codebase. Each file is identified by CID and downloaded via the standard storage path (signed URL with JWT, or public IPFS gateway).
When PINATA_JWT is set, remote metadata is authoritative and overwrites local records. Without a JWT, the bundle manifest provides the metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bundle_name
|
str
|
Name of the bundle (from list_bundles). |
required |
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of fetched files with cid, keyvalues, and local path. |
Source code in src/stargazer/server.py
list_bundles()
¶
List available resource bundles.
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of bundles with name, description, and file_count. |
Source code in src/stargazer/server.py
list_tasks(category=None)
¶
List available tasks and workflows with their parameter signatures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category
|
str | None
|
Filter by "task" or "workflow". Omit for all. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
Catalog of tasks with name, category, description, params, and outputs. |
Source code in src/stargazer/server.py
main()
¶
query_files(keyvalues)
async
¶
Query files by metadata key-value pairs. Returns matching files.
run_task(task_name, filters, inputs=None)
async
¶
Run a single task by name for ad-hoc experimentation.
Use this for testing individual tools in isolation. Asset parameters are assembled from storage using the provided filters — one call to assemble() resolves all required assets. Scalar and Path parameters are passed separately via inputs.
For reproducible pipeline runs, use run_workflow instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_name
|
str
|
Name of the task (from list_tasks with category="task"). |
required |
filters
|
dict
|
Keyvalue filters for assemble() to resolve asset parameters (e.g. {"build": "GRCh38", "sample_id": "NA12878"}). |
required |
inputs
|
dict | None
|
Optional scalar/Path keyword arguments (str, int, bool, list[str]). |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Serialized task output. Single outputs returned directly, |
dict
|
multi-outputs as {"o0": ..., "o1": ...}. |
Source code in src/stargazer/server.py
run_workflow(workflow_name, inputs)
async
¶
Run a workflow by name for reproducible pipeline execution.
Workflows accept scalar parameters (str, int, bool, list[str]) and handle their own asset assembly internally. Pass inputs exactly as the workflow signature defines them — no automatic resolution is performed.
For ad-hoc experimentation with individual tools, use run_task instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow_name
|
str
|
Name of the workflow (from list_tasks with category="workflow"). |
required |
inputs
|
dict
|
Keyword arguments as a JSON dict (scalars only). |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Serialized workflow output. Single outputs returned directly, |
dict
|
multi-outputs as {"o0": ..., "o1": ...}. |
Source code in src/stargazer/server.py
show_config()
async
¶
Show current Stargazer configuration and available task counts.
Source code in src/stargazer/server.py
update_file(cid, keyvalues)
async
¶
Update (merge) metadata on an existing file by CID — fix a typo'd or mis-tagged record without re-uploading the bytes (the CID is unchanged).
keyvalues is a patch: the supplied keys are added or overwritten, keys you omit are preserved (Pinata merge — there is no key removal). It must include "asset"; the patch validates through the same rules as upload (registered keys check their declared fields; reserved underscore keys like _owner are stamped automatically and must not be supplied).
When displaying results, always show a table with the CID and all keyvalues.
Source code in src/stargazer/server.py
upload_file(path, keyvalues)
async
¶
Upload a file with metadata key-value pairs.
keyvalues must include "asset". Registered asset keys (e.g. asset=reference) validate strictly against their declared fields; unregistered keys are stored as generic assets with the keyvalues verbatim. Reserved system keys (underscore-prefixed, e.g. _owner) are stamped automatically and must not be supplied.
When displaying results, always show a table with the CID and all keyvalues.
Source code in src/stargazer/server.py
Tasks¶
apply_bqsr task for Stargazer.¶
Applies BQSR recalibration to BAM files using GATK ApplyBQSR.
spec: docs/architecture/tasks.md
apply_bqsr(alignment, ref, bqsr_report)
async
¶
Apply Base Quality Score Recalibration to a BAM file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
bqsr_report
|
BQSRReport
|
Recalibration table from base_recalibrator |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with recalibrated BAM file |
Source code in src/stargazer/tasks/gatk/apply_bqsr.py
ApplyVQSR task for Stargazer.¶
Applies VQSR recalibration to a VCF using GATK ApplyVQSR.
spec: docs/architecture/tasks.md
apply_vqsr(vcf, ref, vqsr_model, truth_sensitivity_filter_level=None)
async
¶
Apply VQSR recalibration to a VCF using GATK ApplyVQSR.
The recalibration mode (SNP or INDEL) is read from vqsr_model.keyvalues["mode"]. If truth_sensitivity_filter_level is not provided, defaults to 99.5 for SNP and 99.0 for INDEL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf
|
Variants
|
Raw (or SNP-filtered) VCF Variants asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
vqsr_model
|
VQSRModel
|
Recalibration model from variant_recalibrator |
required |
truth_sensitivity_filter_level
|
float | None
|
VQSLOD filter threshold (optional) |
None
|
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset with VQSR-filtered VCF |
Source code in src/stargazer/tasks/gatk/apply_vqsr.py
base_recalibrator task for Stargazer.¶
Creates BQSR recalibration table using GATK BaseRecalibrator.
spec: docs/architecture/tasks.md
base_recalibrator(alignment, ref, known_sites)
async
¶
Generate a Base Quality Score Recalibration report.
Uses GATK BaseRecalibrator to analyze patterns of covariation in the sequence dataset and produce a recalibration table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset (should be sorted and have duplicates marked) |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
known_sites
|
list[KnownSites]
|
List of KnownSites VCF assets (dbSNP, known indels, etc.) |
required |
Returns:
| Type | Description |
|---|---|
BQSRReport
|
BQSRReport asset containing the recalibration table |
Source code in src/stargazer/tasks/gatk/base_recalibrator.py
CombineGVCFs task for Stargazer.¶
Combines multiple per-sample GVCFs into a single multi-sample GVCF for joint genotyping using GATK CombineGVCFs.
spec: docs/architecture/tasks.md
combine_gvcfs(gvcfs, ref, cohort_id='cohort')
async
¶
Combine multiple per-sample GVCFs into a single multi-sample GVCF.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gvcfs
|
list[Variants]
|
List of Variants assets, each containing a GVCF from a single sample |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
cohort_id
|
str
|
Identifier for the combined cohort (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset with combined multi-sample GVCF |
Source code in src/stargazer/tasks/gatk/combine_gvcfs.py
GATK CreateSequenceDictionary task for reference genome.¶
spec: docs/architecture/tasks.md
create_sequence_dictionary(ref)
async
¶
Create a sequence dictionary (.dict file) using GATK CreateSequenceDictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
SequenceDict
|
SequenceDict asset containing the .dict file |
Source code in src/stargazer/tasks/gatk/create_sequence_dictionary.py
haplotype_caller task for Stargazer.¶
Calls germline SNPs and indels via local re-assembly of haplotypes using GATK HaplotypeCaller in GVCF mode.
spec: docs/architecture/tasks.md
haplotype_caller(alignment, ref)
async
¶
Call germline variants in GVCF mode using GATK HaplotypeCaller.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Sorted, duplicate-marked BAM asset (BQSR-recalibrated recommended) |
required |
ref
|
Reference
|
Reference FASTA asset with sequence dictionary |
required |
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset containing the per-sample GVCF |
Source code in src/stargazer/tasks/gatk/haplotype_caller.py
GATK IndexFeatureFile task for indexing VCF files.¶
spec: docs/architecture/tasks.md
index_feature_file(known_sites)
async
¶
Index a VCF file using GATK IndexFeatureFile.
Required by tools like BaseRecalibrator that need random-access queries over known sites VCFs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_sites
|
KnownSites
|
KnownSites VCF asset to index |
required |
Returns:
| Type | Description |
|---|---|
KnownSitesIndex
|
KnownSitesIndex asset pointing to the generated .idx file |
Source code in src/stargazer/tasks/gatk/index_feature_file.py
joint_call_gvcfs task for Stargazer.¶
Consolidates per-sample GVCFs into a GenomicsDB datastore and performs joint genotyping in a single task, avoiding the need to persist the GenomicsDB workspace between tasks.
spec: docs/architecture/tasks.md
joint_call_gvcfs(gvcfs, ref, intervals, cohort_id='cohort')
async
¶
Consolidate GVCFs into GenomicsDB and joint-genotype in a single task.
Runs GenomicsDBImport followed immediately by GenotypeGVCFs within the same execution context, so the workspace never needs to leave the pod.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gvcfs
|
list[Variants]
|
Per-sample GVCF Variants assets from HaplotypeCaller |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
intervals
|
list[str]
|
Genomic intervals to process (required by GenomicsDBImport) |
required |
cohort_id
|
str
|
Sample ID label for the output VCF (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
Joint-genotyped Variants asset (VCF) |
Source code in src/stargazer/tasks/gatk/joint_call_gvcfs.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
mark_duplicates task for Stargazer.¶
Marks duplicate reads in BAM files using GATK MarkDuplicates.
spec: docs/architecture/tasks.md
mark_duplicates(alignment)
async
¶
Mark duplicate reads in a BAM file.
Uses GATK MarkDuplicates to identify and tag duplicate reads that originated from the same DNA fragment (PCR or optical duplicates). Duplicates are marked with the 0x0400 SAM flag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset (should be coordinate sorted) |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with duplicates marked |
Source code in src/stargazer/tasks/gatk/mark_duplicates.py
merge_bam_alignment task for Stargazer.¶
Merges aligned BAM with unmapped BAM using GATK MergeBamAlignment.
spec: docs/architecture/tasks.md
merge_bam_alignment(aligned_bam, unmapped_bam, ref)
async
¶
Merge alignment data from aligned BAM with data in unmapped BAM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aligned_bam
|
Alignment
|
Aligned BAM asset from aligner |
required |
unmapped_bam
|
Alignment
|
Original unmapped BAM asset (must be queryname sorted) |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with merged BAM file |
Source code in src/stargazer/tasks/gatk/merge_bam_alignment.py
sort_sam task for Stargazer.¶
Sorts BAM files using GATK SortSam.
spec: docs/architecture/tasks.md
sort_sam(alignment, sort_order='coordinate')
async
¶
Sort a SAM/BAM file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset to sort |
required |
sort_order
|
str
|
Sort order - one of "coordinate", "queryname", "duplicate" |
'coordinate'
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with sorted BAM file |
Source code in src/stargazer/tasks/gatk/sort_sam.py
VariantRecalibrator task for Stargazer.¶
Builds a recalibration model for VQSR using GATK VariantRecalibrator.
spec: docs/architecture/tasks.md
variant_recalibrator(vcf, ref, resources, mode='SNP')
async
¶
Build a VQSR recalibration model using GATK VariantRecalibrator.
Each KnownSites in resources must carry the following keyvalues:
resource_name: e.g. "hapmap", "omni", "1000G", "dbsnp", "mills"
known: "true" or "false"
training: "true" or "false"
truth: "true" or "false"
prior: numeric string, e.g. "15"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf
|
Variants
|
Raw genotyped VCF Variants asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
resources
|
list[KnownSites]
|
Training/truth VCF resources for the recalibrator |
required |
mode
|
str
|
Variant type to recalibrate — "SNP" or "INDEL" |
'SNP'
|
Returns:
| Type | Description |
|---|---|
VQSRModel
|
VQSRModel asset (recal file) with tranches_path stored in keyvalues |
Source code in src/stargazer/tasks/gatk/variant_recalibrator.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
BWA tasks for reference genome indexing and alignment.¶
spec: docs/architecture/tasks.md
bwa_index(ref)
async
¶
Create BWA index files for a reference genome using bwa index.
Creates the following index files: - .amb, .ann, .bwt, .pac, .sa
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
list[AlignerIndex]
|
List of AlignerIndex assets, one per index file |
Source code in src/stargazer/tasks/general/bwa.py
bwa_mem(ref, r1, r2=None, read_group=None)
async
¶
Align FASTQ reads to reference genome using BWA-MEM.
Produces an unsorted BAM file that typically needs to be sorted before downstream processing (e.g., with sort_sam).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
r1
|
R1
|
R1 FASTQ read asset |
required |
r2
|
R2 | None
|
R2 FASTQ read asset (None for single-end) |
None
|
read_group
|
dict[str, str] | None
|
Optional read group override (ID, SM, LB, PL, PU) |
None
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset containing the unsorted BAM file |
Source code in src/stargazer/tasks/general/bwa.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | |
BWA-MEM2 tasks for reference genome indexing and alignment.¶
spec: docs/architecture/tasks.md
bwa_mem2_index(ref)
async
¶
Create BWA-MEM2 index files for a reference genome.
Creates the following index files: - .amb, .ann, .bwt.2bit.64, .pac, .sa
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
list[AlignerIndex]
|
List of AlignerIndex assets, one per index file |
Reference
Source code in src/stargazer/tasks/general/bwa_mem2.py
bwa_mem2_mem(ref, r1, r2=None, read_group=None)
async
¶
Align FASTQ reads to reference genome using BWA-MEM2.
Produces an unsorted BAM file that typically needs to be sorted before downstream processing (e.g., with sort_sam).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
r1
|
R1
|
R1 FASTQ read asset |
required |
r2
|
R2 | None
|
R2 FASTQ read asset (None for single-end) |
None
|
read_group
|
dict[str, str] | None
|
Optional read group override (ID, SM, LB, PL, PU) |
None
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset containing the unsorted BAM file |
Reference
Source code in src/stargazer/tasks/general/bwa_mem2.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | |
Samtools tasks for reference genome indexing.¶
spec: docs/architecture/tasks.md
samtools_faidx(ref)
async
¶
Create a FASTA index (.fai file) using samtools faidx.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
ReferenceIndex
|
ReferenceIndex asset containing the .fai file |
Source code in src/stargazer/tasks/general/samtools.py
Leiden community detection clustering for scRNA-seq data.¶
spec: docs/workflows/scrna.md
cluster(adata, resolution=0.5, key_added='leiden')
async
¶
Assign cells to clusters using the Leiden algorithm.
Requires a precomputed neighbor graph (.uns["neighbors"]). Cluster labels are stored in .obs[key_added].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData asset with neighbor graph |
required |
resolution
|
float
|
Leiden resolution parameter (higher = more clusters) |
0.5
|
key_added
|
str
|
.obs column name to store cluster labels |
'leiden'
|
Returns:
| Type | Description |
|---|---|
AnnData
|
AnnData asset with cluster labels in .obs |
Source code in src/stargazer/tasks/scrna/cluster.py
Marker gene identification via differential expression for scRNA-seq data.¶
spec: docs/workflows/scrna.md
find_markers(adata, groupby='leiden', method='wilcoxon')
async
¶
Identify marker genes for each cluster using differential expression.
Uses raw count data from .layers["counts"] for statistical testing. Results are stored in .uns["rank_genes_groups"].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Clustered AnnData asset with .layers["counts"] |
required |
groupby
|
str
|
.obs column to group cells by (cluster labels) |
'leiden'
|
method
|
str
|
Statistical test method ("wilcoxon", "t-test", etc.) |
'wilcoxon'
|
Returns:
| Type | Description |
|---|---|
AnnData
|
AnnData asset with ranked marker genes in .uns["rank_genes_groups"] |
Source code in src/stargazer/tasks/scrna/find_markers.py
Normalization and log transformation for scRNA-seq data.¶
spec: docs/workflows/scrna.md
normalize(adata)
async
¶
Normalize counts and apply log1p transformation.
Stores raw counts in .layers["counts"] before normalization so they are available for downstream differential expression analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
QC-filtered AnnData asset |
required |
Returns:
| Type | Description |
|---|---|
AnnData
|
Normalized and log-transformed AnnData asset |
Source code in src/stargazer/tasks/scrna/normalize.py
Quality control and cell/gene filtering for scRNA-seq data.¶
spec: docs/workflows/scrna.md
qc_filter(adata, min_genes=100, min_cells=3, max_pct_mt=20.0, batch_key='', organism='human')
async
¶
Filter low-quality cells and genes from raw scRNA-seq data.
Applies standard QC filters: minimum gene/cell thresholds, mitochondrial gene percentage, and scrublet doublet detection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Raw AnnData asset (.h5ad) |
required |
min_genes
|
int
|
Minimum number of genes expressed per cell |
100
|
min_cells
|
int
|
Minimum number of cells a gene must be expressed in |
3
|
max_pct_mt
|
float
|
Maximum mitochondrial gene percentage allowed per cell |
20.0
|
batch_key
|
str
|
Column in .obs to use as batch for scrublet (empty = no batch) |
''
|
organism
|
str
|
Organism name ("human" or "mouse") for gene prefix selection |
'human'
|
Returns:
| Type | Description |
|---|---|
AnnData
|
Filtered AnnData asset with QC metrics in .obs |
Source code in src/stargazer/tasks/scrna/qc_filter.py
PCA, neighbor graph, and UMAP dimensionality reduction for scRNA-seq data.¶
spec: docs/workflows/scrna.md
reduce_dimensions(adata, n_pcs=50, n_neighbors=15)
async
¶
Compute PCA, k-nearest neighbor graph, and UMAP embedding.
Operates on the highly variable gene subset to reduce noise. Embeddings are stored in .obsm for downstream clustering and visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData asset with HVG annotations in .var |
required |
n_pcs
|
int
|
Number of principal components to compute |
50
|
n_neighbors
|
int
|
Number of neighbors for the kNN graph |
15
|
Returns:
| Type | Description |
|---|---|
AnnData
|
AnnData asset with PCA (.obsm["X_pca"]), neighbors, and UMAP (.obsm["X_umap"]) |
Source code in src/stargazer/tasks/scrna/reduce_dimensions.py
Highly variable gene selection for scRNA-seq data.¶
spec: docs/workflows/scrna.md
select_features(adata, n_top_genes=2000, batch_key='')
async
¶
Select highly variable genes for dimensionality reduction.
Annotates .var with highly_variable flags. Downstream tasks use only the highly variable subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Normalized AnnData asset |
required |
n_top_genes
|
int
|
Number of top highly variable genes to select |
2000
|
batch_key
|
str
|
Column in .obs to use as batch (empty = no batch correction) |
''
|
Returns:
| Type | Description |
|---|---|
AnnData
|
AnnData asset with HVG annotations in .var |
Source code in src/stargazer/tasks/scrna/select_features.py
Utils¶
Local filesystem storage client for Stargazer.¶
Always the primary storage client. Stores files locally with TinyDB metadata indexing and delegates to a remote backend (PinataClient) or the public IPFS gateway for cache misses.
Also provides the module-level factory and singleton:
- get_client(): create a LocalStorageClient based on available config
- default_client: pre-built singleton used across the application
spec: docs/architecture/configuration.md
LocalStorageClient
¶
Local filesystem storage client with optional remote backend.
Always handles caching and TinyDB metadata. Downloads follow this order:
- Return if file already exists at component.path
- Check local cache by CID
- If remote backend (PinataClient) is attached, fetch via signed URL
- Fall back to public IPFS gateway
When a PinataClient remote is attached, upload/query/delete delegate to it. Without a remote, upload/query/delete operate locally only.
Usage
client = LocalStorageClient() comp = Asset(path=Path("data.bam"), keyvalues={"type": "alignment"}) await client.upload(comp) files = await client.query({"type": "alignment"}) await client.download(comp)
Source code in src/stargazer/utils/local_storage.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | |
db
property
¶
Get TinyDB instance for local metadata storage (lazy initialized).
Re-opens if the DB file has been deleted or modified externally, keeping _last_id in sync when other processes write to the same file.
__init__(local_dir=None, remote=None, public_gateway=None)
¶
Initialize local storage client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_dir
|
Optional[Path]
|
Local directory for file storage (defaults to STARGAZER_LOCAL) |
None
|
remote
|
Optional[PinataClient]
|
Optional PinataClient for authenticated Pinata operations |
None
|
public_gateway
|
Optional[str]
|
Public IPFS gateway URL (defaults to PINATA_GATEWAY) |
None
|
Source code in src/stargazer/utils/local_storage.py
delete(component)
async
¶
Delete a file. Delegates to remote if attached, otherwise deletes locally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
Source code in src/stargazer/utils/local_storage.py
download(component, dest=None)
async
¶
Download a file by CID. Checks cache, then remote, then public gateway.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set. When |
required |
dest
|
Optional[Path]
|
Optional destination path (copies file there) |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if the file was already cached, False if freshly downloaded. |
Source code in src/stargazer/utils/local_storage.py
query(keyvalues)
async
¶
Query files by keyvalue metadata. Delegates to remote if attached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keyvalues
|
dict[str, str]
|
Metadata key-value pairs to filter by |
required |
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of raw storage records with 'cid', 'path', and 'keyvalues' keys |
Source code in src/stargazer/utils/local_storage.py
update_metadata(cid, keyvalues, network=None)
async
¶
Merge a metadata patch onto an existing record by CID.
Delegates to the remote (Pinata PUT) if attached, otherwise merges into the local TinyDB record. Merge semantics match Pinata: supplied keys are added or overwritten, omitted keys preserved, bytes/CID untouched.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cid
|
str
|
Content identifier of the record to update |
required |
keyvalues
|
dict[str, str]
|
Metadata patch to merge |
required |
network
|
Optional[str]
|
"private"/"public" forwarded to the remote (ignored locally) |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
The updated record dict ( |
Raises:
| Type | Description |
|---|---|
ValueError
|
when no record exists for |
Source code in src/stargazer/utils/local_storage.py
upload(component)
async
¶
Upload a file. Delegates to remote if attached, otherwise stores locally.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with path and keyvalues set |
required |
Source code in src/stargazer/utils/local_storage.py
get_client()
¶
Create a storage client based on available credentials.
Always returns a LocalStorageClient. When PINATA_JWT is available, a PinataClient remote is attached for authenticated operations (upload, query, delete, private downloads). Public IPFS gateway access is always available for downloading public CIDs.
Resolution logic
- PINATA_JWT set -> LocalStorageClient + PinataClient remote
- No JWT -> LocalStorageClient (public gateway only)
Returns:
| Type | Description |
|---|---|
LocalStorageClient
|
A LocalStorageClient, optionally with a PinataClient remote |
Source code in src/stargazer/utils/local_storage.py
Pinata API v3 client for IPFS file storage.¶
Provides async interface for authenticated Pinata operations: - Uploading files with keyvalue metadata - Downloading private files via signed gateway URLs - Querying files by keyvalue pairs - Deleting files
Used as a remote backend by LocalStorageClient when PINATA_JWT is available.
spec: docs/architecture/configuration.md
PinataClient
¶
Async client for Pinata API v3.
Handles authenticated operations against the Pinata API: uploads, private downloads via signed URLs, metadata queries, and deletions.
This is a pure remote transport — caching is handled by LocalStorageClient.
PINATA_VISIBILITY controls upload network and query/download behavior: - "private": uploads as private, downloads via signed URLs, queries /files/private - "public": uploads as public, downloads via public gateway (handled by LocalStorageClient), queries /files/public
If JWT is unset, only public downloads are possible (via LocalStorageClient's public gateway fallback).
Usage
client = PinataClient() comp = Asset(path=Path("data.bam"), keyvalues={"type": "alignment"}) await client.upload(comp) # sets comp.cid files = await client.query({"type": "alignment", "sample": "NA12878"}) await client.delete(comp)
Source code in src/stargazer/utils/pinata.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | |
jwt
property
¶
Get JWT token, raising error if not set.
__init__(jwt=None, visibility=None)
¶
Initialize Pinata client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jwt
|
Optional[str]
|
Pinata JWT token (defaults to PINATA_JWT from config) |
None
|
visibility
|
Optional[str]
|
"public" or "private" (defaults to PINATA_VISIBILITY from config) |
None
|
Source code in src/stargazer/utils/pinata.py
create_signed_upload_url(filename, keyvalues, network, expires=300, max_file_size=None)
async
¶
Mint a signed upload URL for a direct browser→Pinata upload.
All metadata is fixed at mint time — Pinata bakes filename,
keyvalues, and the size cap into the URL, so the uploader
supplies bytes only and can never attach unvalidated metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
Name the uploaded file will carry (downloads resolve their on-disk name from it) |
required |
keyvalues
|
dict[str, str]
|
Validated, already-stamped metadata to bake in |
required |
network
|
str
|
"private" or "public" |
required |
expires
|
int
|
URL lifetime in seconds after minting |
300
|
max_file_size
|
Optional[int]
|
Upload size cap in bytes, if any |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The signed upload URL |
Source code in src/stargazer/utils/pinata.py
delete(component)
async
¶
Delete a file from Pinata by querying for its internal ID first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
Source code in src/stargazer/utils/pinata.py
download_to(cid, dest)
async
¶
Download a file to dest. Uses signed URL for private, raises for public.
Public downloads are handled by LocalStorageClient's public gateway fallback, so this method is only called for private visibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cid
|
str
|
Content identifier |
required |
dest
|
Path
|
Destination path to write to |
required |
Source code in src/stargazer/utils/pinata.py
query(keyvalues, network=None)
async
¶
Query files by keyvalue metadata from Pinata API.
By default checks both private and public Pinata endpoints and
merges results by CID so files are found regardless of which
network they were uploaded to; pass network to query a single
endpoint (the asset-manager page lists per-tab).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keyvalues
|
dict[str, str]
|
Metadata key-value pairs to filter by |
required |
network
|
Optional[str]
|
"private" or "public" to query one endpoint only |
None
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of matching file records with cid, name, keyvalues, and |
list[dict]
|
the network each record was found on |
Source code in src/stargazer/utils/pinata.py
update_metadata(cid, keyvalues, network=None)
async
¶
Merge keyvalues onto an existing file's metadata (Pinata PUT).
Pinata's update is a merge/upsert (verified empirically): the
supplied keys are added or overwritten, keys omitted from the patch
are preserved — there is no key removal. The file bytes and CID are
untouched (the CID is content-addressed; keyvalues live in the
account index, not the file), so existing *_cid provenance edges
pointing at this record stay valid. Looks up the internal file id by
CID first (Pinata keys updates off the UUID, not the CID), then PUTs.
_owner is restamped from STARGAZER_OWNER (env wins, unset
passes through) so a re-attributed edit stays consistent with upload.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cid
|
str
|
Content identifier of the file to update |
required |
keyvalues
|
dict[str, str]
|
Metadata patch to merge (partial — only these keys
change). Reserved |
required |
network
|
Optional[str]
|
"private" or "public"; defaults to the client visibility |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
The updated record: |
Raises:
| Type | Description |
|---|---|
ValueError
|
when no file exists on |
Source code in src/stargazer/utils/pinata.py
upload(component)
async
¶
Upload a file to IPFS via Pinata. Sets component.cid.
Files up to TUS_THRESHOLD_BYTES go via the plain multipart POST;
larger files use the resumable TUS endpoint (chunked, no resume yet).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with path and keyvalues set |
required |
Source code in src/stargazer/utils/pinata.py
Query generation utilities for Stargazer.¶
Utilities for generating metadata queries, including support for cartesian product queries across multiple dimensions.
spec: docs/architecture/types.md
generate_query_combinations(base_query, filters)
¶
Generate query combinations from filters using cartesian product.
Takes a base query dict and filters dict, where filters can contain scalar values or lists. For any list-valued filter, generates all combinations using cartesian product, while preserving scalar filters and the base query in all combinations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_query
|
Dict[str, Any]
|
Base query dict to include in all combinations |
required |
filters
|
Dict[str, Any]
|
Filter dict with scalar or list values |
required |
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of query dicts representing all combinations |
Example
base = {"type": "reference"} filters = {"build": "GRCh38", "tool": ["fasta", "bwa"]} generate_query_combinations(base, filters) [ {"type": "reference", "build": "GRCh38", "tool": "fasta"}, {"type": "reference", "build": "GRCh38", "tool": "bwa"} ]
base = {"type": "reference"} filters = {"build": ["GRCh38", "GRCh37"], "tool": ["fasta", "bwa"]} generate_query_combinations(base, filters) [ {"type": "reference", "build": "GRCh38", "tool": "fasta"}, {"type": "reference", "build": "GRCh38", "tool": "bwa"}, {"type": "reference", "build": "GRCh37", "tool": "fasta"}, {"type": "reference", "build": "GRCh37", "tool": "bwa"} ]
Source code in src/stargazer/utils/query.py
Workflows¶
GATK Best Practices: Data Pre-processing for Variant Discovery¶
Implements: 1. Reference preparation — FASTA index, sequence dictionary, BWA index 2. Sample preprocessing — align, sort, mark duplicates, BQSR
spec: docs/architecture/workflows.md
prepare_reference(build)
async
¶
Prepare reference genome for alignment and variant calling.
Assembles the reference FASTA from storage and creates necessary indices: 1. FASTA index (samtools faidx) 2. BWA index (bwa index)
All indices are uploaded to storage as side-effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier (e.g. "GRCh38") |
required |
Returns:
| Type | Description |
|---|---|
Reference
|
Reference asset (FASTA file) |
Source code in src/stargazer/workflows/gatk_data_preprocessing.py
preprocess_sample(build, sample_id)
async
¶
Pre-process a single sample's reads for variant calling.
Assembles reference and reads from storage, then runs: 1. BWA-MEM alignment 2. Coordinate sort (GATK SortSam) 3. Mark duplicates (GATK MarkDuplicates)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier |
required |
sample_id
|
str
|
Sample identifier used to query reads |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with the preprocessed BAM file |
Source code in src/stargazer/workflows/gatk_data_preprocessing.py
GATK Best Practices: Germline Short Variant Discovery (SNPs + Indels)¶
End-to-end GATK pipeline from raw reads to joint-genotyped variants
- prepare_reference — FASTA index, sequence dictionary, BWA index
- preprocess_sample — align, sort, mark duplicates (per sample, parallel)
- haplotype_caller — per-sample GVCF (parallel)
- joint_call_gvcfs — GenomicsDBImport + GenotypeGVCFs
spec: docs/architecture/workflows.md
germline_short_variant_discovery(build, sample_ids, cohort_id='cohort')
async
¶
End-to-end germline short variant discovery from raw reads.
Runs the full GATK best-practices pipeline: 1. Reference preparation (indexing) 2. Per-sample preprocessing (align, sort, mark duplicates) in parallel 3. HaplotypeCaller per sample in parallel 4. Joint genotyping (GenomicsDBImport + GenotypeGVCFs)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier (e.g. "GRCh38") |
required |
sample_ids
|
list[str]
|
List of sample identifiers to process |
required |
cohort_id
|
str
|
Identifier for the cohort output (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
Joint-genotyped Variants asset |
Source code in src/stargazer/workflows/germline_short_variant_discovery.py
scRNA-seq clustering pipeline: QC → normalization → clustering → marker genes.¶
Implements the scanpy clustering tutorial workflow as Flyte v2 tasks. Assembles a raw AnnData from storage by sample_id, then runs the full preprocessing and clustering stack.
Prerequisites
A raw .h5ad file must be uploaded to storage with asset="anndata" and stage="raw".
spec: docs/workflows/scrna.md
scrna_clustering_pipeline(sample_id, organism='human', n_top_genes=2000, resolution=0.5, max_pct_mt=20.0)
async
¶
End-to-end scRNA-seq clustering pipeline.
Runs QC filtering, normalization, feature selection, dimensionality reduction, Leiden clustering, and marker gene identification in sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_id
|
str
|
Sample identifier used to look up the raw AnnData in storage |
required |
organism
|
str
|
Organism name (e.g. "human", "mouse") |
'human'
|
n_top_genes
|
int
|
Number of highly variable genes to select |
2000
|
resolution
|
float
|
Leiden clustering resolution (higher = more clusters) |
0.5
|
max_pct_mt
|
float
|
Maximum mitochondrial gene percentage per cell |
20.0
|
Returns:
| Type | Description |
|---|---|
AnnData
|
Annotated AnnData asset with cluster labels and ranked marker genes |