API Reference¶
Config¶
Marshal¶
Output marshaling: typed object → dict (for MCP response serialization).¶
spec: docs/architecture/mcp-server.md
marshal_output(value)
¶
Convert a typed Python object to a JSON-friendly structure for MCP transport.
Source code in src/stargazer/marshal.py
Registry¶
Task registry for auto-discovery of Flyte tasks and workflows.
Discovers all tasks from stargazer.tasks and stargazer.workflows modules, extracts parameter types, defaults, and return types for MCP catalog exposure.
spec: docs/architecture/mcp-server.md
TaskInfo
dataclass
¶
Complete metadata about a registered task.
Source code in src/stargazer/registry.py
TaskOutput
dataclass
¶
TaskParam
dataclass
¶
TaskRegistry
dataclass
¶
Discovers and provides access to all Flyte tasks and workflows.
Source code in src/stargazer/registry.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |
__post_init__()
¶
get(name)
¶
list_tasks(category=None)
¶
List all registered tasks, optionally filtered by category.
Source code in src/stargazer/registry.py
to_catalog(category=None)
¶
Return a JSON-serializable catalog of all tasks.
Source code in src/stargazer/registry.py
Server¶
Stargazer MCP Server.¶
Exposes storage tools and a dynamic task runner via FastMCP. Tasks and workflows are auto-discovered from the registry and executed through the Flyte local run context.
Usage
stargazer # stdio transport (default) stargazer --http # streamable-http transport
spec: docs/architecture/mcp-server.md
delete_file(cid)
async
¶
download_file(cid)
async
¶
Download a file by CID to local cache. Returns the local path.
list_tasks(category=None)
¶
List available tasks and workflows with their parameter signatures.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category
|
str | None
|
Filter by "task" or "workflow". Omit for all. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict]
|
Catalog of tasks with name, category, description, params, and outputs. |
Source code in src/stargazer/server.py
main()
¶
query_files(keyvalues)
async
¶
Query files by metadata key-value pairs. Returns matching files.
run_task(task_name, filters, inputs=None)
async
¶
Run a single task by name for ad-hoc experimentation.
Use this for testing individual tools in isolation. Asset parameters are assembled from storage using the provided filters — one call to assemble() resolves all required assets. Scalar and Path parameters are passed separately via inputs.
For reproducible pipeline runs, use run_workflow instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_name
|
str
|
Name of the task (from list_tasks with category="task"). |
required |
filters
|
dict
|
Keyvalue filters for assemble() to resolve asset parameters (e.g. {"build": "GRCh38", "sample_id": "NA12878"}). |
required |
inputs
|
dict | None
|
Optional scalar/Path keyword arguments (str, int, bool, list[str]). |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Serialized task output. Single outputs returned directly, |
dict
|
multi-outputs as {"o0": ..., "o1": ...}. |
Source code in src/stargazer/server.py
run_workflow(workflow_name, inputs)
async
¶
Run a workflow by name for reproducible pipeline execution.
Workflows accept scalar parameters (str, int, bool, list[str]) and handle their own asset assembly internally. Pass inputs exactly as the workflow signature defines them — no automatic resolution is performed.
For ad-hoc experimentation with individual tools, use run_task instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
workflow_name
|
str
|
Name of the workflow (from list_tasks with category="workflow"). |
required |
inputs
|
dict
|
Keyword arguments as a JSON dict (scalars only). |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Serialized workflow output. Single outputs returned directly, |
dict
|
multi-outputs as {"o0": ..., "o1": ...}. |
Source code in src/stargazer/server.py
show_config()
async
¶
Show current Stargazer configuration and available task counts.
Source code in src/stargazer/server.py
upload_file(path, keyvalues)
async
¶
Upload a file with metadata key-value pairs.
keyvalues must include "asset". Valid asset keys are derived from the Asset registry (e.g. asset=reference component=fasta).
When displaying results, always show a table with the CID and all keyvalues.
Source code in src/stargazer/server.py
Tasks¶
apply_bqsr task for Stargazer.¶
Applies BQSR recalibration to BAM files using GATK ApplyBQSR.
spec: docs/architecture/tasks.md
apply_bqsr(alignment, ref, bqsr_report)
async
¶
Apply Base Quality Score Recalibration to a BAM file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
bqsr_report
|
BQSRReport
|
Recalibration table from base_recalibrator |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with recalibrated BAM file |
Source code in src/stargazer/tasks/gatk/apply_bqsr.py
ApplyVQSR task for Stargazer.¶
Applies VQSR recalibration to a VCF using GATK ApplyVQSR.
spec: docs/architecture/tasks.md
apply_vqsr(vcf, ref, vqsr_model, truth_sensitivity_filter_level=None)
async
¶
Apply VQSR recalibration to a VCF using GATK ApplyVQSR.
The recalibration mode (SNP or INDEL) is read from vqsr_model.keyvalues["mode"]. If truth_sensitivity_filter_level is not provided, defaults to 99.5 for SNP and 99.0 for INDEL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf
|
Variants
|
Raw (or SNP-filtered) VCF Variants asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
vqsr_model
|
VQSRModel
|
Recalibration model from variant_recalibrator |
required |
truth_sensitivity_filter_level
|
float | None
|
VQSLOD filter threshold (optional) |
None
|
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset with VQSR-filtered VCF |
Source code in src/stargazer/tasks/gatk/apply_vqsr.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
base_recalibrator task for Stargazer.¶
Creates BQSR recalibration table using GATK BaseRecalibrator.
spec: docs/architecture/tasks.md
base_recalibrator(alignment, ref, known_sites)
async
¶
Generate a Base Quality Score Recalibration report.
Uses GATK BaseRecalibrator to analyze patterns of covariation in the sequence dataset and produce a recalibration table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset (should be sorted and have duplicates marked) |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
known_sites
|
list[KnownSites]
|
List of KnownSites VCF assets (dbSNP, known indels, etc.) |
required |
Returns:
| Type | Description |
|---|---|
BQSRReport
|
BQSRReport asset containing the recalibration table |
Source code in src/stargazer/tasks/gatk/base_recalibrator.py
CombineGVCFs task for Stargazer.¶
Combines multiple per-sample GVCFs into a single multi-sample GVCF for joint genotyping using GATK CombineGVCFs.
spec: docs/architecture/tasks.md
combine_gvcfs(gvcfs, ref, cohort_id='cohort')
async
¶
Combine multiple per-sample GVCFs into a single multi-sample GVCF.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gvcfs
|
list[Variants]
|
List of Variants assets, each containing a GVCF from a single sample |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
cohort_id
|
str
|
Identifier for the combined cohort (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset with combined multi-sample GVCF |
Source code in src/stargazer/tasks/gatk/combine_gvcfs.py
GATK CreateSequenceDictionary task for reference genome.¶
spec: docs/architecture/tasks.md
create_sequence_dictionary(ref)
async
¶
Create a sequence dictionary (.dict file) using GATK CreateSequenceDictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
SequenceDict
|
SequenceDict asset containing the .dict file |
Source code in src/stargazer/tasks/gatk/create_sequence_dictionary.py
GenomicsDBImport task for Stargazer.¶
Import VCFs to GenomicsDB for efficient joint genotyping of large cohorts.
spec: docs/architecture/tasks.md
genomics_db_import(gvcfs, workspace_path, intervals=None)
async
¶
Import GVCFs to GenomicsDB workspace for scalable joint genotyping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gvcfs
|
list[Variants]
|
List of per-sample GVCF Variants assets to import |
required |
workspace_path
|
Path
|
Path where GenomicsDB workspace will be created |
required |
intervals
|
list[str] | None
|
Genomic intervals to process (e.g., ["chr1", "chr2:100000-200000"]) |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the created GenomicsDB workspace directory |
Source code in src/stargazer/tasks/gatk/genomics_db_import.py
haplotype_caller task for Stargazer.¶
Calls germline SNPs and indels via local re-assembly of haplotypes using GATK HaplotypeCaller in GVCF mode.
spec: docs/architecture/tasks.md
haplotype_caller(alignment, ref)
async
¶
Call germline variants in GVCF mode using GATK HaplotypeCaller.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Sorted, duplicate-marked BAM asset (BQSR-recalibrated recommended) |
required |
ref
|
Reference
|
Reference FASTA asset with sequence dictionary |
required |
Returns:
| Type | Description |
|---|---|
Variants
|
Variants asset containing the per-sample GVCF |
Source code in src/stargazer/tasks/gatk/haplotype_caller.py
joint_call_gvcfs task for Stargazer.¶
Consolidates per-sample GVCFs into a GenomicsDB datastore and performs joint genotyping in a single task, avoiding the need to persist the GenomicsDB workspace between tasks.
spec: docs/architecture/tasks.md
joint_call_gvcfs(gvcfs, ref, intervals, cohort_id='cohort')
async
¶
Consolidate GVCFs into GenomicsDB and joint-genotype in a single task.
Runs GenomicsDBImport followed immediately by GenotypeGVCFs within the same execution context, so the workspace never needs to leave the pod.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gvcfs
|
list[Variants]
|
Per-sample GVCF Variants assets from HaplotypeCaller |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
intervals
|
list[str]
|
Genomic intervals to process (required by GenomicsDBImport) |
required |
cohort_id
|
str
|
Sample ID label for the output VCF (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
Joint-genotyped Variants asset (VCF) |
Source code in src/stargazer/tasks/gatk/joint_call_gvcfs.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
mark_duplicates task for Stargazer.¶
Marks duplicate reads in BAM files using GATK MarkDuplicates.
spec: docs/architecture/tasks.md
mark_duplicates(alignment)
async
¶
Mark duplicate reads in a BAM file.
Uses GATK MarkDuplicates to identify and tag duplicate reads that originated from the same DNA fragment (PCR or optical duplicates). Duplicates are marked with the 0x0400 SAM flag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset (should be coordinate sorted) |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with duplicates marked |
Source code in src/stargazer/tasks/gatk/mark_duplicates.py
merge_bam_alignment task for Stargazer.¶
Merges aligned BAM with unmapped BAM using GATK MergeBamAlignment.
spec: docs/architecture/tasks.md
merge_bam_alignment(aligned_bam, unmapped_bam, ref)
async
¶
Merge alignment data from aligned BAM with data in unmapped BAM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aligned_bam
|
Alignment
|
Aligned BAM asset from aligner |
required |
unmapped_bam
|
Alignment
|
Original unmapped BAM asset (must be queryname sorted) |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with merged BAM file |
Source code in src/stargazer/tasks/gatk/merge_bam_alignment.py
sort_sam task for Stargazer.¶
Sorts BAM files using GATK SortSam.
spec: docs/architecture/tasks.md
sort_sam(alignment, sort_order='coordinate')
async
¶
Sort a SAM/BAM file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alignment
|
Alignment
|
Input BAM asset to sort |
required |
sort_order
|
str
|
Sort order - one of "coordinate", "queryname", "duplicate" |
'coordinate'
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with sorted BAM file |
Source code in src/stargazer/tasks/gatk/sort_sam.py
VariantRecalibrator task for Stargazer.¶
Builds a recalibration model for VQSR using GATK VariantRecalibrator.
spec: docs/architecture/tasks.md
variant_recalibrator(vcf, ref, resources, mode='SNP')
async
¶
Build a VQSR recalibration model using GATK VariantRecalibrator.
Each KnownSites in resources must carry the following keyvalues:
resource_name: e.g. "hapmap", "omni", "1000G", "dbsnp", "mills"
known: "true" or "false"
training: "true" or "false"
truth: "true" or "false"
prior: numeric string, e.g. "15"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vcf
|
Variants
|
Raw genotyped VCF Variants asset |
required |
ref
|
Reference
|
Reference FASTA asset |
required |
resources
|
list[KnownSites]
|
Training/truth VCF resources for the recalibrator |
required |
mode
|
str
|
Variant type to recalibrate — "SNP" or "INDEL" |
'SNP'
|
Returns:
| Type | Description |
|---|---|
VQSRModel
|
VQSRModel asset (recal file) with tranches_path stored in keyvalues |
Source code in src/stargazer/tasks/gatk/variant_recalibrator.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
BWA tasks for reference genome indexing and alignment.¶
spec: docs/architecture/tasks.md
bwa_index(ref)
async
¶
Create BWA index files for a reference genome using bwa index.
Creates the following index files: - .amb, .ann, .bwt, .pac, .sa
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
list[AlignerIndex]
|
List of AlignerIndex assets, one per index file |
Source code in src/stargazer/tasks/general/bwa.py
bwa_mem(ref, r1, r2=None, read_group=None)
async
¶
Align FASTQ reads to reference genome using BWA-MEM.
Produces an unsorted BAM file that typically needs to be sorted before downstream processing (e.g., with sort_sam).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
r1
|
R1
|
R1 FASTQ read asset |
required |
r2
|
R2 | None
|
R2 FASTQ read asset (None for single-end) |
None
|
read_group
|
dict[str, str] | None
|
Optional read group override (ID, SM, LB, PL, PU) |
None
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset containing the unsorted BAM file |
Source code in src/stargazer/tasks/general/bwa.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |
Samtools tasks for reference genome indexing.¶
spec: docs/architecture/tasks.md
samtools_faidx(ref)
async
¶
Create a FASTA index (.fai file) using samtools faidx.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
Reference
|
Reference FASTA asset |
required |
Returns:
| Type | Description |
|---|---|
ReferenceIndex
|
ReferenceIndex asset containing the .fai file |
Source code in src/stargazer/tasks/general/samtools.py
Types¶
Alignment asset types for Stargazer.¶
spec: docs/architecture/types.md
Alignment
dataclass
¶
Bases: Asset
BAM/CRAM alignment file asset.
Carries reference_cid and r1_cid for provenance (PROV entity derivation).
Source code in src/stargazer/types/alignment.py
AlignmentIndex
dataclass
¶
Bases: Asset
BAI/CRAI alignment index file asset.
Carries alignment_cid linking to the Alignment it indexes.
Source code in src/stargazer/types/alignment.py
BQSRReport
dataclass
¶
Bases: Asset
BQSR recalibration table produced by GATK BaseRecalibrator.
Carries alignment_cid linking to the Alignment it was produced from.
Source code in src/stargazer/types/alignment.py
Asset base dataclass for Stargazer.¶
spec: docs/architecture/types.md
Asset
dataclass
¶
Base class for all typed file assets in Stargazer.
Attributes:
| Name | Type | Description |
|---|---|---|
cid |
str
|
Content identifier (CID) for the stored file |
path |
Path | None
|
Local filesystem path (set after download or upload) |
keyvalues |
dict[str, str]
|
Metadata key-value pairs for querying and routing |
Subclasses declare typed field annotations directly:
@dataclass
class Alignment(Asset):
_asset_key: ClassVar[str] = "alignment"
sample_id: str = ""
duplicates_marked: bool = False
__init_subclass__ auto-derives _field_types (non-str fields) and
_field_defaults (all defaults) from the annotations. The _field_types
and _field_defaults ClassVars on the base class are empty-dict defaults
inherited by subclasses that declare no fields.
Source code in src/stargazer/types/asset.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
__getattr__(name)
¶
Fall back to keyvalues lookup for undeclared attributes on base Asset.
__getattribute__(name)
¶
Read declared fields from keyvalues with type coercion; delegate everything else.
Source code in src/stargazer/types/asset.py
__init_subclass__(**kwargs)
¶
Register subclass in the asset registry and derive field metadata.
Source code in src/stargazer/types/asset.py
__post_init__()
¶
__setattr__(name, value)
¶
Coerce and store declared fields into keyvalues; bypass for core attrs.
Source code in src/stargazer/types/asset.py
fetch()
async
¶
Download this asset and all its companions from storage.
Downloads the asset itself, then queries storage for any assets linked
via {_asset_key}_cid to auto-download companions (e.g. indices,
mate reads).
Source code in src/stargazer/types/asset.py
from_dict(data)
classmethod
¶
Reconstruct from a serialized dict.
Source code in src/stargazer/types/asset.py
to_dict()
¶
update(path, **kwargs)
async
¶
Upload file and set cid. Shared by all asset types.
Source code in src/stargazer/types/asset.py
assemble(**filters)
async
¶
Query storage by keyvalue filters and return specialized assets.
The asset filter key accepts a string or list of strings to narrow
by asset type. Other filters are passed through as keyvalue matchers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**filters
|
Any
|
Keyvalue filters. Values may be scalars or lists (cartesian product). |
{}
|
Returns:
| Type | Description |
|---|---|
list[Asset]
|
Flat list of specialized Asset subclass instances. |
Examples:
assets = await assemble(build="GRCh38", asset="reference") ref = next(a for a in assets if isinstance(a, Reference))
assets = await assemble(sample_id="NA12878", asset=["r1", "r2"]) r1 = next(a for a in assets if isinstance(a, R1))
Source code in src/stargazer/types/asset.py
Read file asset types for Stargazer.¶
spec: docs/architecture/types.md
R1
dataclass
¶
Bases: Asset
R1 (forward) FASTQ read file asset.
Carries mate_cid pointing to the paired R2 asset's CID (None for single-end).
Source code in src/stargazer/types/reads.py
Reference genome asset types for Stargazer.¶
spec: docs/architecture/types.md
AlignerIndex
dataclass
¶
Bases: Asset
Aligner index file asset (one file per index file for multi-file indices).
Source code in src/stargazer/types/reference.py
Reference
dataclass
¶
Bases: Asset
Reference FASTA file asset.
Source code in src/stargazer/types/reference.py
contigs
property
¶
Read contig names from the companion .fai index.
Requires fetch() to have been called first so the ReferenceIndex companion is downloaded alongside this reference.
ReferenceIndex
dataclass
¶
Bases: Asset
FASTA index (.fai) file asset.
Carries reference_cid linking back to the Reference it was built from.
Source code in src/stargazer/types/reference.py
Variant call asset types for Stargazer.¶
spec: docs/architecture/types.md
KnownSites
dataclass
¶
Bases: Asset
Known variant sites VCF used for BQSR.
Standalone asset — carries build and source fields, no container needed.
Source code in src/stargazer/types/variants.py
KnownSitesIndex
dataclass
¶
Bases: Asset
VCF index (.idx) file for a KnownSites asset.
Carries known_sites_cid linking to the KnownSites VCF it indexes. Fetched automatically alongside the VCF via Asset.fetch().
Source code in src/stargazer/types/variants.py
VQSRModel
dataclass
¶
Bases: Asset
VQSR recalibration model (.recal file + tranches path).
Produced by VariantRecalibrator. The recal file is the primary path; the companion tranches file path is stored in keyvalues["tranches_path"].
Source code in src/stargazer/types/variants.py
Variants
dataclass
¶
Utils¶
Local filesystem storage client for Stargazer.¶
Stores files locally with TinyDB metadata indexing. No network access required.
spec: docs/architecture/modes.md
LocalStorageClient
¶
Local filesystem storage client.
Stores files in a local directory and indexes metadata in TinyDB. No network access or API credentials required.
Usage
client = LocalStorageClient() comp = Asset(path=Path("data.bam"), keyvalues={"type": "alignment"}) await client.upload(comp) files = await client.query({"type": "alignment"}) await client.download(comp)
Source code in src/stargazer/utils/local_storage.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | |
db
property
¶
Get TinyDB instance for local metadata storage (lazy initialized).
Re-opens if the DB file has been deleted or modified externally, keeping _last_id in sync when other processes write to the same file.
__init__(local_dir=None)
¶
Initialize local storage client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
local_dir
|
Optional[Path]
|
Local directory for file storage (defaults to STARGAZER_LOCAL env var or ~/.stargazer/local) |
None
|
Source code in src/stargazer/utils/local_storage.py
delete(component)
async
¶
Delete a file from local storage and TinyDB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
Source code in src/stargazer/utils/local_storage.py
download(component, dest=None)
async
¶
Resolve a local file path and set component.path. For local storage, files are already on disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
dest
|
Optional[Path]
|
Optional destination path (copies file there) |
None
|
Source code in src/stargazer/utils/local_storage.py
query(keyvalues)
async
¶
Query files by keyvalue metadata from TinyDB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keyvalues
|
dict[str, str]
|
Metadata key-value pairs to filter by |
required |
Returns:
| Type | Description |
|---|---|
list[Asset]
|
List of matching Asset objects |
Source code in src/stargazer/utils/local_storage.py
upload(component)
async
¶
Copy a file to local storage, index metadata in TinyDB, and set component.cid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with path and keyvalues set |
required |
Source code in src/stargazer/utils/local_storage.py
Pinata API v3 client for IPFS file storage.¶
Provides async interface for: - Uploading files with keyvalue metadata - Downloading files via IPFS gateway with local caching - Querying files by keyvalue pairs - Deleting files
spec: docs/architecture/modes.md
PinataClient
¶
Async client for Pinata API v3.
Used when STARGAZER_MODE=local and PINATA_JWT is available. Handles uploads to IPFS via Pinata, downloads via IPFS gateways, and metadata queries against the Pinata API.
Usage
client = PinataClient()
Upload with metadata¶
comp = Asset(path=Path("data.bam"), keyvalues={"type": "alignment"}) await client.upload(comp) # sets comp.cid
Query by keyvalues¶
files = await client.query({"type": "alignment", "sample": "NA12878"})
Download¶
await client.download(comp) # sets comp.path
Delete file¶
await client.delete(comp)
Source code in src/stargazer/utils/pinata.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 | |
jwt
property
¶
Get JWT token, raising error if not set.
__init__(jwt=None, gateway=None, local_dir=None)
¶
Initialize Pinata client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jwt
|
Optional[str]
|
Pinata JWT token (defaults to PINATA_JWT env var) |
None
|
gateway
|
Optional[str]
|
IPFS gateway URL (defaults to gateway.pinata.cloud) |
None
|
local_dir
|
Optional[Path]
|
Local directory for download caching |
None
|
Source code in src/stargazer/utils/pinata.py
delete(component)
async
¶
Delete a file from Pinata by querying for its internal ID first.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
Source code in src/stargazer/utils/pinata.py
download(component, dest=None)
async
¶
Download a file from IPFS and set component.path. Uses local cache to avoid re-downloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with cid set |
required |
dest
|
Optional[Path]
|
Optional destination path (otherwise uses cache) |
None
|
Source code in src/stargazer/utils/pinata.py
query(keyvalues)
async
¶
Query files by keyvalue metadata from Pinata API. Paginates through all results automatically.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keyvalues
|
dict[str, str]
|
Metadata key-value pairs to filter by |
required |
Returns:
| Type | Description |
|---|---|
list[Asset]
|
List of matching Asset objects |
Source code in src/stargazer/utils/pinata.py
upload(component)
async
¶
Upload a file to IPFS via Pinata. Sets component.cid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component
|
Asset
|
Asset with path and keyvalues set |
required |
Source code in src/stargazer/utils/pinata.py
Query generation utilities for Stargazer.¶
Utilities for generating metadata queries, including support for cartesian product queries across multiple dimensions.
spec: docs/architecture/types.md
generate_query_combinations(base_query, filters)
¶
Generate query combinations from filters using cartesian product.
Takes a base query dict and filters dict, where filters can contain scalar values or lists. For any list-valued filter, generates all combinations using cartesian product, while preserving scalar filters and the base query in all combinations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_query
|
Dict[str, Any]
|
Base query dict to include in all combinations |
required |
filters
|
Dict[str, Any]
|
Filter dict with scalar or list values |
required |
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List of query dicts representing all combinations |
Example
base = {"type": "reference"} filters = {"build": "GRCh38", "tool": ["fasta", "bwa"]} generate_query_combinations(base, filters) [ {"type": "reference", "build": "GRCh38", "tool": "fasta"}, {"type": "reference", "build": "GRCh38", "tool": "bwa"} ]
base = {"type": "reference"} filters = {"build": ["GRCh38", "GRCh37"], "tool": ["fasta", "bwa"]} generate_query_combinations(base, filters) [ {"type": "reference", "build": "GRCh38", "tool": "fasta"}, {"type": "reference", "build": "GRCh38", "tool": "bwa"}, {"type": "reference", "build": "GRCh37", "tool": "fasta"}, {"type": "reference", "build": "GRCh37", "tool": "bwa"} ]
Source code in src/stargazer/utils/query.py
Storage abstraction for Stargazer.¶
Defines the StorageClient protocol, StargazerMode enum, and factory function for creating storage clients based on STARGAZER_MODE configuration.
Configuration
STARGAZER_MODE=local (default) -> local exec, local storage (or Pinata if JWT present) STARGAZER_MODE=cloud -> union exec, Pinata storage (PINATA_JWT required)
spec: docs/architecture/modes.md
StargazerMode
¶
StorageClient
¶
Bases: Protocol
Protocol defining the storage client interface.
All storage backends (local, Pinata) implement this interface.
Source code in src/stargazer/utils/storage.py
get_client()
¶
Create a storage client based on STARGAZER_MODE and available credentials.
Resolution logic
- STARGAZER_MODE=cloud -> PinataClient (PINATA_JWT required)
- STARGAZER_MODE=local + PINATA_JWT -> PinataClient
- STARGAZER_MODE=local (no JWT) -> LocalStorageClient
Returns:
| Type | Description |
|---|---|
StorageClient
|
A StorageClient implementation |
Source code in src/stargazer/utils/storage.py
resolve_mode()
¶
Resolve the current Stargazer mode from STARGAZER_MODE env var.
Returns:
| Type | Description |
|---|---|
StargazerMode
|
StargazerMode.LOCAL or StargazerMode.CLOUD |
Raises:
| Type | Description |
|---|---|
ValueError
|
If STARGAZER_MODE is set to an invalid value |
Source code in src/stargazer/utils/storage.py
Workflows¶
GATK Best Practices: Data Pre-processing for Variant Discovery¶
Implements: 1. Reference preparation — FASTA index, sequence dictionary, BWA index 2. Sample preprocessing — align, sort, mark duplicates, BQSR
spec: docs/architecture/workflows.md
prepare_reference(build)
async
¶
Prepare reference genome for alignment and variant calling.
Assembles the reference FASTA from storage and creates necessary indices: 1. FASTA index (samtools faidx) 2. BWA index (bwa index)
All indices are uploaded to storage as side-effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier (e.g. "GRCh38") |
required |
Returns:
| Type | Description |
|---|---|
Reference
|
Reference asset (FASTA file) |
Source code in src/stargazer/workflows/gatk_data_preprocessing.py
preprocess_sample(build, sample_id, run_bqsr=True)
async
¶
Pre-process a single sample's reads for variant calling.
Assembles reference and reads from storage, then runs: 1. BWA-MEM alignment 2. Coordinate sort (GATK SortSam) 3. Mark duplicates (GATK MarkDuplicates) 4. BQSR (optional, GATK BaseRecalibrator + ApplyBQSR)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier |
required |
sample_id
|
str
|
Sample identifier used to query reads and known sites |
required |
run_bqsr
|
bool
|
Whether to apply BQSR (default: True) |
True
|
Returns:
| Type | Description |
|---|---|
Alignment
|
Alignment asset with the preprocessed BAM file |
Source code in src/stargazer/workflows/gatk_data_preprocessing.py
GATK Best Practices: Germline Short Variant Discovery (SNPs + Indels)¶
Implements the full GATK pipeline from preprocessed BAMs
- HaplotypeCaller — per-sample GVCF (parallel)
- joint_call_gvcfs — GenomicsDBImport + GenotypeGVCFs in one task
- VariantRecalibrator (INDEL) — build VQSR model
- ApplyVQSR INDEL — filter indels → final VCF
Prerequisites
Reference and sample alignments must already be in storage (run prepare_reference and preprocess_sample first). VQSR training resources (HapMap, omni, 1000G, mills, dbSNP) must be stored with build, resource_name, known, training, truth, and prior keyvalues, tagged with vqsr_mode=INDEL.
spec: docs/architecture/workflows.md
germline_short_variant_discovery(build, cohort_id='cohort')
async
¶
Germline short variant discovery from preprocessed BAMs.
Assembles all BQSR-applied alignments and reference from storage, then runs the full GATK Best Practices pipeline through VQSR filtering.
Expects preprocess_sample to have been run first — alignments must have bqsr_applied=true.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
build
|
str
|
Reference genome build identifier (e.g. "GRCh38") |
required |
cohort_id
|
str
|
Identifier for the cohort output (default: "cohort") |
'cohort'
|
Returns:
| Type | Description |
|---|---|
Variants
|
VQSR-filtered joint-genotyped Variants asset |