get_genome_fasta {ORFik} | R Documentation |
This function automatically downloads (if files not already exists)
genomes and contaminants specified for genome alignment.
Will create a R transcript database (TxDb object) from the annotation.
It will also index the genome for you
If you misspelled something or crashed, delete wrong files and
run again.
Do remake = TRUE, to do it all over again.
get_genome_fasta(genome, output.dir, organism, assembly_type, db, gunzip)
genome |
logical, default: TRUE, download genome of organism
specified in "organism" argument. If FALSE, check if the downloaded
file already exist. If you want to use a custom gtf from you hard drive,
set GTF = FALSE,
and assign: |
output.dir |
directory to save downloaded data |
organism |
scientific name of organism, Homo sapiens,
Danio rerio, Mus musculus, etc. See |
assembly_type |
a character string specifying from which assembly type
the genome shall be retrieved from (ensembl only, else this argument is ignored):
Default is
|
db |
database to use for genome and GTF, default adviced: "ensembl" (remember to set assembly_type to "primary_assembly", else it will contain haplotypes, very large file!). Alternatives: "refseq" (primary assembly) and "genbank" (mix) |
gunzip |
logical, default TRUE, uncompress downloaded files that are zipped when downloaded, should be TRUE! |
If you want custom genome or gtf from you hard drive, assign it
after you run this function, like this:
annotation <- getGenomeAndAnnotation(GTF = FALSE, genome = FALSE)
annotation["genome"] = "path/to/genome.fasta"
annotation["gtf"] = "path/to/gtf.gtf"
a named character vector of path to genomes and gtf downloaded, and additional contaminants if used. If merge_contaminants is TRUE, will not give individual fasta files to contaminants, but only the merged one.
Other STAR:
STAR.align.folder()
,
STAR.align.single()
,
STAR.allsteps.multiQC()
,
STAR.index()
,
STAR.install()
,
STAR.multiQC()
,
STAR.remove.crashed.genome()
,
install.fastp()
## Get Saccharomyces cerevisiae genome and gtf (create txdb for R) #getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel") ## Get Danio rerio genome and gtf (create txdb for R) #getGenomeAndAnnotation("Danio rerio", tempdir()) output.dir <- "/Bio_data/references/zebrafish" ## Get Danio rerio and Phix contamints to deplete during alignment #getGenomeAndAnnotation("Danio rerio", output.dir, phix = TRUE) ## Optimize for ORFik (speed up for large annotations like human or zebrafish) #getGenomeAndAnnotation("Danio rerio", tempdir(), optimize = TRUE) ## How to save malformed refseq gffs: ## First run function and let it crash: #annotation <- getGenomeAndAnnotation(organism = "Arabidopsis thaliana", output.dir = "~/Desktop/test_plant/", # assembly_type = "primary_assembly", db = "refseq") ## Then apply a fix (example for linux, too long rows): # \code{system("cat ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.gff | awk '{ if (length($0) < 32768) print }' > ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed2.gff")} ## Then updated arguments: annotation <- c("~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed.gff", "~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.fna") names(annotation) <- c("gtf", "genome") # Make the txdb (for faster R use) # makeTxdbFromGenome(annotation["gtf"], annotation["genome"], organism = "Arabidopsis thaliana")