文章目录
- 简介
- 安装
- ```datasets download```下载基因组/基因序列
- 按照GCA list文件编号下载
- 下载大基因组
- genome完整参数
- gene参数
- ```datasets summary```下载元数据
- ```dataformat```将json转换成表格格式
- 通过json文件解析其他字段
- 问题
简介
NCBI Datasets 可以轻松从 NCBI 数据库中收集数据。使用命令行界面(CLI)工具或 NCBI Datasets 网页界面查找和下载基因和基因组的序列、注释和元数据。如下是可用的工具:
安装
- 使用conda安装Datasets CLI tools,
datasets
and dataformat
:
# 注意不是datasets而是ncbi-datasets-cli
$ conda install -c conda-forge ncbi-datasets-cli
(base) [yut@io02 ~]$ datasets --version
datasets version: 15.25.0
datasets download
下载基因组/基因序列
datasets
从 NCBI 下载所有生命领域的生物序列数据,dataformat
将前者下载的数据包中的元数据从 JSON Lines 格式转换为其他格式。
使用datasets
下载人类参考基因组 GRCh38 的基因组数据包:
$ datasets download genome taxon human --reference --filename human-reference.zip
使用 dataformat
从下载的人类参考基因组 GRCh38 数据包中提取选定的元数据字段:
$ dataformat tsv genome --package human-reference.zip --fields organism-name,assminfo-name,accession,assminfo-submitter
Organism name Assembly Name Assembly Accession Assembly Submitter
Homo sapiens GRCh38.p14 GCF_000001405.40 Genome Reference Consortium
按照GCA list文件编号下载
(base) [yut@io02 02_Glacier_new_taxa]$ head 3.gca
GCF_020042285.1
GCF_020783315.1
GCF_024343615.1
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile 3.gca --include gff3,rna,cds,protein,genome,seq-report --filename 3genome.zip
# --inputfile:输入GCA号的list,每行一个
# --filename:输出zip包名称,默认ncbi-dataset.zipNew version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 3 genome records [================================================] 100% 3/3
Downloading: 3genome.zip 10.4MB valid zip archive
Validating package files [================================================] 100% 18/18real 0m8.208s
user 0m0.652s
sys 0m0.234s
(base) [yut@io02 02_Glacier_new_taxa]$ ls
3genome.zip download.log
下载大基因组
下载大量基因组,首先下载压缩包,然后分三步访问数据。
- 1.下载人基因组压缩包
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
- 2.解压
unzip human_GRCh38_dataset.zip -d my_human_dataset
- 3.转换格式
datasets rehydrate --directory my_human_dataset/
genome完整参数
(base) [yut@io02 ~]$ datasets download genome --helpDownload a genome data package. Genome data packages may include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.The default genome data package includes the following files:* <accession>_<assembly_name>_genomic.fna (genomic sequences)* assembly_data_report.jsonl (data report with genome assembly and annotation metadata)* dataset_catalog.json (a list of files and file types included in the data package)Usagedatasets download genome [flags]datasets download genome [command]Sample Commandsdatasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rnadatasets download genome taxon "bos taurus" --dehydrateddatasets download genome taxon human --assembly-level chromosome,complete --dehydrateddatasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydratedAvailable Commandsaccession Download a genome data package by Assembly or BioProject accessiontaxon Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)Flags--annotated Limit to annotated genomes--assembly-level string Limit to genomes at one or more assembly levels (comma-separated):* chromosome* complete* contig* scaffold(default "[]")--assembly-source string Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")--chromosomes strings Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes--dehydrated Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).--exclude-atypical Exclude atypical assemblies--mag string Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")--preview Show information about the requested data package--reference Limit to reference genomes--released-after string Limit to genomes released on or after a specified date (MM/DD/YYYY)--released-before string Limit to genomes released on or before a specified date (MM/DD/YYYY)--search strings Limit results to genomes with specified text in the searchable fields:species and infraspecies, assembly name and submitter.To search multiple strings, use the flag multiple times.Global Flags--api-key string Specify an NCBI API key--debug Emit debugging info--filename string Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")--help Print detailed help about a datasets command--no-progressbar Hide progress bar--version Print version of datasetsUse datasets download genome <command> --help for detailed help about a command.
gene参数
(base) [yut@io02 ~]$ datasets download gene --helpDownload a gene data package. Gene data packages include gene, transcript and protein sequences and one or more data reports. Data packages are downloaded as a zip archive.The default gene data package for NM, NR, NP, XM, XR, XP and YP accessions:* rna.fna (transcript sequences)* protein.faa (protein sequences)* data_report.jsonl (data report with gene metadata)* dataset_catalog.json (a list of files and file types included in the data package)Usagedatasets download gene [flags]datasets download gene [command]Sample Commandsdatasets download gene gene-id 672datasets download gene symbol brca1 --taxon mousedatasets download gene accession NP_000483.3datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1Available Commandsgene-id Download a gene data package by NCBI Gene IDsymbol Download a gene data package by gene symbolaccession Download a gene data package by RefSeq nucleotide or protein accessiontaxon Download a gene data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)Flags--fasta-filter strings Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions--fasta-filter-file string Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions included in the specified file--preview Show information about the requested data packageGlobal Flags--api-key string Specify an NCBI API key--debug Emit debugging info--filename string Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")--help Print detailed help about a datasets command--no-progressbar Hide progress bar--version Print version of datasetsUse datasets download gene <command> --help for detailed help about a command.
datasets summary
下载元数据
(base) [yut@io02 ~]$ datasets summary --helpPrint a data report containing gene, genome or virus metadata in JSON format.Usagedatasets summary [flags]datasets summary [command]Sample Commandsdatasets summary genome accession GCF_000001405.40datasets summary genome taxon "mus musculus"datasets summary gene gene-id 672datasets summary gene symbol brca1 --taxon mousedatasets summary gene accession NP_000483.3datasets summary virus genome accession NC_045512.2datasets summary virus genome taxon sars-cov-2 --host dogAvailable Commandsgene Print a summary of a gene datasetgenome Print a data report containing genome metadatavirus Print a data report containing virus genome metadataGlobal Flags--api-key string Specify an NCBI API key--debug Emit debugging info--help Print detailed help about a datasets command--version Print version of datasetsUse datasets summary <command> --help for detailed help about a command.
- 实例
(base) [yut@io02 ~]$ datasets summary genome accession GCF_000001405.40
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"accession":"GCF_000001405.40","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.99187225,"duplicated":0.007256894,"fragmented":0.0015239477,"missing":0.0066037737,"single_copy":0.9846154,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; RefSeqFE; cmsearch; tRNAscan-SE","name":"GCF_000001405.40-RS_2023_10","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2023-10-02","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2023_10.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":22158,"other":413,"protein_coding":20080,"pseudogene":17001,"total":59652}},"status":"Updated annotation"},"assembly_info":{"assembly_level":"Chromosome","assembly_name":"GRCh38.p14","assembly_status":"current","assembly_type":"haploid-with-alt-loci","bioproject_accession":"PRJNA31257","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA31257","title":"The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"}]}],"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_000001405.40","description":"Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)","paired_assembly":{"accession":"GCA_000001405.29","only_genbank":"4 unlocalized and unplaced scaffolds.","status":"current"},"refseq_category":"reference genome","release_date":"2022-02-03","submitter":"Genome Reference Consortium","synonym":"hg38"},"assembly_stats":{"contig_l50":18,"contig_n50":57879411,"gaps_between_scaffolds_count":349,"gc_count":"1374283647","gc_percent":41,"number_of_component_sequences":35611,"number_of_contigs":996,"number_of_organelles":1,"number_of_scaffolds":470,"scaffold_l50":16,"scaffold_n50":67794873,"total_number_of_chromosomes":24,"total_sequence_length":"3099441038","total_ungapped_length":"2948318359"},"current_accession":"GCF_000001405.40","organelle_info":[{"description":"Mitochondrion","submitter":"Genome Reference Consortium","total_seq_length":"16569"}],"organism":{"common_name":"human","organism_name":"Homo sapiens","tax_id":9606},"paired_accession":"GCA_000001405.29","source_database":"SOURCE_DATABASE_REFSEQ"}],"total_count": 1}(base) [yut@io02 ~]$ datasets summary gene gene-id 672
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"gene":{"annotations":[{"annotation_name":"GCF_000001405.40-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_000001405.40","assembly_name":"GRCh38.p14","genomic_locations":[{"genomic_accession_version":"NC_000017.11","genomic_range":{"begin":"43044295","end":"43170327","orientation":"minus"},"sequence_name":"17"}]},{"annotation_name":"GCF_009914755.1-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_009914755.1","assembly_name":"T2T-CHM13v2.0","genomic_locations":[{"genomic_accession_version":"NC_060941.1","genomic_range":{"begin":"43902857","end":"44029084","orientation":"minus"},"sequence_name":"17"}]}],"chromosomes":["17"],"common_name":"human","description":"BRCA1 DNA repair associated","ensembl_gene_ids":["ENSG00000012048"],"gene_groups":[{"id":"672","method":"NCBI Ortholog"}],"gene_id":"672","nomenclature_authority":{"authority":"HGNC","identifier":"HGNC:1100"},"omim_ids":["113705"],"orientation":"minus","protein_count":368,"reference_standards":[{"gene_range":{"accession_version":"NG_005905.2","range":[{"begin":"92501","end":"173689","orientation":"plus"}]},"type":"REFSEQ_GENE"}],"swiss_prot_accessions":["P38398"],"symbol":"BRCA1","synonyms":["IRIS","PSCP","BRCAI","BRCC1","FANCS","PNCA4","RNF53","BROVCA1","PPP1R53"],"tax_id":"9606","taxname":"Homo sapiens","transcript_count":368,"transcript_type_counts":[{"count":368,"type":"PROTEIN_CODING"}],"type":"PROTEIN_CODING"},"query":["672"]}],"total_count": 1}
- 下载结果为json格式
dataformat
将json转换成表格格式
(base) [yut@io02 ~]$ dataformat tsvConvert data to TSV format.Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.Usagedataformat tsv [command]Report Commandsgenome Convert Genome Assembly Data Report into TSV formatgenome-seq Convert Genome Assembly Sequence Report into TSV formatgene Convert Gene Report into TSV formatgene-product Convert Gene Product Report into TSV formatvirus-genome Convert Virus Data Report into TSV formatvirus-annotation Convert Virus Annotation Report into TSV formatmicrobigge Convert MicroBIGG-E Data Report into TSV formatprok-gene Convert Prokaryote Gene Report into TSV formatprok-gene-location Convert Prokaryote Gene Location Report into TSV formatgenome-annotations Convert Genome Annotation Report into TSV formatFlags--elide-header Do not output header-h, --help help for tsvGlobal Flags--force Force dataformat to run without type check promptUse dataformat tsv <command> --help for detailed help about a command.(base) [yut@io02 ~]$ dataformat tsv gene
Error: --inputfile and/or --packagefile must be specified, or data can be read from standard input
Usagedataformat tsv gene [flags]Examplesdataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonldataformat tsv gene --package genes.zipFlags--fields strings Comma-separated list of fields (default annotation-assembly-accession,annotation-assembly-name,annotation-genomic-range-accession,annotation-genomic-range-exon-order,annotation-genomic-range-exon-orientation,annotation-genomic-range-exon-start,annotation-genomic-range-exon-stop,annotation-genomic-range-range-order,annotation-genomic-range-range-orientation,annotation-genomic-range-range-start,annotation-genomic-range-range-stop,annotation-genomic-range-seq-name,annotation-release-date,annotation-release-name,chromosomes,common-name,description,ensembl-geneids,gene-id,gene-type,genomic-region-gene-range-accession,genomic-region-gene-range-range-order,genomic-region-gene-range-range-orientation,genomic-region-gene-range-range-start,genomic-region-gene-range-range-stop,genomic-region-genomic-region-type,group-id,group-method,name-authority,name-id,omim-ids,orientation,protein-count,ref-standard-gene-range-accession,ref-standard-gene-range-range-order,ref-standard-gene-range-range-orientation,ref-standard-gene-range-range-start,ref-standard-gene-range-range-stop,ref-standard-genomic-region-type,replaced-gene-id,rna-type,swissprot-accessions,symbol,synonyms,tax-id,tax-name,transcript-count)- annotation-assembly-accession- annotation-assembly-name- annotation-genomic-range-accession- annotation-genomic-range-exon-order- annotation-genomic-range-exon-orientation- annotation-genomic-range-exon-start- annotation-genomic-range-exon-stop- annotation-genomic-range-range-order- annotation-genomic-range-range-orientation- annotation-genomic-range-range-start- annotation-genomic-range-range-stop- annotation-genomic-range-seq-name- annotation-release-date- annotation-release-name- chromosomes- common-name- description- ensembl-geneids- gene-id- gene-type- genomic-region-gene-range-accession- genomic-region-gene-range-range-order- genomic-region-gene-range-range-orientation- genomic-region-gene-range-range-start- genomic-region-gene-range-range-stop- genomic-region-genomic-region-type- group-id- group-method- name-authority- name-id- omim-ids- orientation- protein-count- ref-standard-gene-range-accession- ref-standard-gene-range-range-order- ref-standard-gene-range-range-orientation- ref-standard-gene-range-range-start- ref-standard-gene-range-range-stop- ref-standard-genomic-region-type- replaced-gene-id- rna-type- swissprot-accessions- symbol- synonyms- tax-id- tax-name- transcript-count-h, --help help for gene--inputfile string Input file (default "ncbi_dataset/data/data_report.jsonl")--package string Data package (zip archive), inputfile parameter is relative to the root path inside the archiveGlobal Flags--elide-header Do not output header--force Force dataformat to run without type check prompt
- 实例
(base) [yut@io02 ~]$ datasets summary gene gene-id 672 --as-json-lines |dataformat tsv gene
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Annotation Assembly Accession Annotation Assembly Name Annotation Genomic Range Accession Annotation Genomic Range Exons Order Annotation Genomic Range Exons Orientation Annotation Genomic Range Exons Start Annotation Genomic Range Exons Stop Annotation Genomic Range Order Annotation Genomic Range Orientation Annotation Genomic Range Start Annotation Genomic Range Stop Annotation Genomic Range Seq Name Annotation Release Date Annotation Release Name Chromosomes Common Name Description Ensembl GeneIDs NCBI GeneID Gene Type Genomic Region Gene Range Sequence Accession Genomic Region Gene Range Order Genomic Region Gene Range Orientation Genomic Region Gene Range Start Genomic Region Gene Range Stop Genomic Region Genomic Region Type Gene Group Identifier Gene Group Method Nomenclature Authority Nomenclature ID OMIM IDs Orientation Proteins Reference Standard Gene Range Sequence Accession Reference Standard Gene Range Order Reference Standard Gene Range Orientation Reference Standard Gene Range Start Reference Standard Gene Range Stop Reference Standard Genomic Region Type Replaced NCBI GeneID RNA Type SwissProt Accessions Symbol Synonyms Taxonomic ID Taxonomic Name Transcripts
GCF_000001405.40 GRCh38.p14 NC_000017.11 minus 43044295 43170327 17 2023-10-02 GCF_000001405.40-RS_2023_10 17 human BRCA1 DNA repair associated ENSG00000012048 672 PROTEIN_CODING 672 NCBI Ortholog HGNC HGNC:1100 113705 minus 368 NG_005905.2 plus 92501 173689 REFSEQ_GENE P38398 BRCA1 IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606 Homo sapiens 368
GCF_009914755.1 T2T-CHM13v2.0 NC_060941.1 minus 43902857 44029084 17 2023-10-02 GCF_009914755.1-RS_2023_10 17 human BRCA1 DNA repair associated ENSG00000012048 672 PROTEIN_CODING 672 NCBI Ortholog HGNC HGNC:1100 113705 minus 368 NG_005905.2 plus 92501 173689 REFSEQ_GENE P38398 BRCA1 IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606 Homo sapiens 368(base) [yut@io02 ~]$ datasets summary gene gene-id 672 --as-json-lines |dataformat tsv gene --fields gene-id,gene-type,symbol
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
NCBI GeneID Gene Type Symbol
672 PROTEIN_CODING BRCA1# --as-json-lines必须加上
# --fields指定需要的字段,多个空格隔开
通过json文件解析其他字段
- 某些字段无法通过
dataformat
提取出来,可先保存成json文件,然后通过下面脚本解析:
(base) [yut@node01 ~]$ cat dataset.json
{"accession":"GCA_013141435.1","annotation_info":{"method":"Best-placed reference protein set; GeneMarkS-2+","name":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","provider":"NCBI","release_date":"2020-05-14","software_version":"4.11","stats":{"gene_counts":{"non_coding":27,"protein_coding":2566,"pseudogene":17,"total":2610}}},"assembly_info":{"assembly_level":"Contig","assembly_method":"MetaSPAdes v. 3.10.1","assembly_name":"ASM1314143v1","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA622654","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA622654","title":"Metagenomic profiling of ammonia and methane-oxidizing microorganisms in a Dutch drinking water treatment plant"}]}],"biosample":{"accession":"SAMN14539096","attributes":[{"name":"isolation_source","value":"Primary rapid sand filter"},{"name":"collection_date","value":"not applicable"},{"name":"geo_loc_name","value":"Netherlands"},{"name":"lat_lon","value":"not applicable"},{"name":"isolate","value":"P-RSF-IL-07"},{"name":"depth","value":"not applicable"},{"name":"env_broad_scale","value":"drinking water treatment plant"},{"name":"env_local_scale","value":"Primary rapid sand filter"},{"name":"env_medium","value":"not applicable"},{"name":"metagenomic","value":"1"},{"name":"environmental-sample","value":"1"},{"name":"sample_type","value":"metagenomic assembly"},{"name":"metagenome-source","value":"drinking water metagenome"},{"name":"derived_from","value":"This BioSample is a metagenomic assembly obtained from the drinking water metagenome BioSample:SAMN14524263, SAMN14524264, SAMN14524265, SAMN14524266"}],"bioprojects":[{"accession":"PRJNA622654"}],"description":{"comment":"Keywords: GSC:MIxS;MIMAG:6.0","organism":{"organism_name":"Ferruginibacter sp.","tax_id":1940288},"title":"MIMAG Metagenome-assembled Genome sample from Ferruginibacter sp."},"last_updated":"2020-05-19T00:50:12.857","models":["MIMAG.water"],"owner":{"contacts":[{}],"name":"Radboud University"},"package":"MIMAG.water.6.0","publication_date":"2020-05-19T00:50:12.857","sample_ids":[{"label":"Sample name","value":"Ferruginibacter sp. P-RSF-IL-07"}],"status":{"status":"live","when":"2020-05-19T00:50:12.857"},"submission_date":"2020-04-04T13:17:04.950"},"comments":"The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/","genome_notes":["derived from metagenome"],"release_date":"2020-05-21","sequencing_tech":"Illumina MiSeq","submitter":"Radboud University"},"assembly_stats":{"contig_l50":10,"contig_n50":104094,"gc_count":"978119","gc_percent":32,"genome_coverage":"270.4x","number_of_component_sequences":43,"number_of_contigs":43,"total_sequence_length":"3056910","total_ungapped_length":"3056910"},"average_nucleotide_identity":{"best_ani_match":{"ani":79.65,"assembly":"GCA_003426875.1","assembly_coverage":0.01,"category":"type","organism_name":"Lutibacter oceani","type_assembly_coverage":0.01},"category":"category_na","comment":"na","match_status":"low_coverage","submitted_organism":"Ferruginibacter sp.","submitted_species":"Ferruginibacter sp.","taxonomy_check_status":"Inconclusive"},"current_accession":"GCA_013141435.1","organism":{"infraspecific_names":{"isolate":"P-RSF-IL-07"},"organism_name":"Ferruginibacter sp.","tax_id":1940288},"source_database":"SOURCE_DATABASE_GENBANK","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JABFQZ000000000.1","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JABFQZ01","wgs_project_accession":"JABFQZ01"}}(base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py *json
Save result in output.csv
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly
(base) [yut@node01 ~]$ cat ~/Software/Important_scripts/Parse_dataset_genome_json_metadata.py
#!/usr/bin/env python
import argparse
import json
import pandas as pd# 创建参数解析器
parser = argparse.ArgumentParser(description='Parse JSON data')
parser.add_argument('json_file', help='Path to the JSON file')# 解析参数
args = parser.parse_args()# 读取JSON文件
with open(args.json_file, 'r') as file:json_str = file.read()# 解析JSON
data = json.loads(json_str)# 获取env_broad_scale字段的值
# 获取所需字段的值
accession = data["accession"]
geo_loc_name = data["assembly_info"]["biosample"]["attributes"][2]["value"]
lat_lon = data["assembly_info"]["biosample"]["attributes"][3]["value"]
collection_date = data['assembly_info']['biosample']['attributes'][1]['value']
env_broad_scale = data["assembly_info"]["biosample"]["attributes"][6]["value"]
env_local_scale = data['assembly_info']['biosample']['attributes'][7]['value']
env_medium = data['assembly_info']['biosample']['attributes'][8]['value']
sample_type = data['assembly_info']['biosample']['attributes'][11]['value']# output
# 创建DataFrame
df = pd.DataFrame({'Accession': [accession],'Geo Location Name': [geo_loc_name],'Latitude and Longitude': [lat_lon],'Collection date' : [collection_date],'Env broad scale' : [env_broad_scale],'Env local scale' : [env_local_scale],'Env medium' : [env_medium],'Sample type' : [sample_type]
})# 将DataFrame保存为CSV文件
df.to_csv('output.csv', index=False)
print('Save result in output.csv ')# run
$ (base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py dataset.json
Save result in output.csv
# output
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly
问题
- Error: Internal error (invalid zip archive)并且没有输出文件
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile GTDB_R214_63_Ferruginibacter_genus.GCA --include gff3,rna,cds,protein,genome,seq-report
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 63 genome records [================================================] 100% 63/63
Downloading: ncbi_dataset.zip 146MB done
Validating package files [===========>------------------------------------] 28% 70/252
Error: Internal error (invalid zip archive). Please try again
上述问题可能是输入编号既包括GCA又包括GCF编号,解决办法是将两者分开下载,或者等到Validating package files停掉命令