R package:blupADC-Feature 1

Overview

geno_format is the basic function of package:blupADC. By applying geno_format , we can convert multiple genotype data formats in an easy way, including Hapmap, Plink, BLUPF90, Numeric, Haplotype and VCF.

Example

Format conversion based on provided R variable

library(blupADC)
format_result=geno_format(
         input_data_hmp=example_data_hmp,  #provided hapmap data object 
         output_data_type=c("Plink","BLUPF90","Numeric","VCF"),# output data format
         return_result = TRUE,      # return result 
         cpu_cores=1                # number of cpu 
         )

Format conversion based on provided data path and data name

#convert phased VCF data to haplotype format and  haplotype-based numeric format
library(blupADC)
data_path=system.file("extdata", package = "blupADC")  #  path of example files 
phased_result=geno_format(
         input_data_path=data_path,      # input data path 
         input_data_name="example.vcf",  # input data name,for vcf data 
         input_data_type="VCF",          # input data type
         phased_genotype=TRUE,           # whether the vcf data has been phased
         haplotype_window_nSNP=5,        # according to nSNP define block,
         output_data_type=c("Haplotype","Numeric"),# output data format
         return_result=TRUE,             #save result as a R environment variable
         cpu_cores=1                     # number of cpu 
                  )

Format conversion via bigmemory method

library(blupADC)
data_path=system.file("extdata", package = "blupADC")  #  path of example files 
phased=geno_format(
         input_data_path=data_path,      # input data path 
         input_data_name="example.vcf",  # input data name,for vcf data
         input_data_type="VCF",          # input data type
         phased_genotype=TRUE,           # whether the vcf data has been phased
         haplotype_window_nSNP=5,        # according to nSNP define haplotype-block,
         bigmemory_cal=TRUE,             # format conversion via bigmemory object
         bigmemory_data_path=getwd(),    # path of bigmemory data 
         bigmemory_data_name="test_blupADC", #name of bigmemory data 
         output_data_type=c("Haplotype","Numeric"),# output data format
         return_result=TRUE,             #save result in R environment
         cpu_cores=1                     # number of cpu 
                  )

Output

According to the result of output, we find that the output contains 5 parts, including:

  • hmp : Hapmap format genotype data

The first column stands for the name of SNP, the thrid column stands for chromosome, the fourth column stands for the physical postion, and the twelth column and the after columns stand for the genotype data

rs#alleleschromposstrandassemblycenterprotLSIDassayLSIDpanelLSIDQCcode3098349832972452
SNP1NA1224488NANANANANANANACCACACCC
SNP2NA1293696NANANANANANANAGGTGTGGG
SNP3NA1333333NANANANANANANAGGTTTTGG
SNP4NA1464830NANANANANANANACCCCCCCC
SNP5NA1722623NANANANANANANAAAGGGGAA
SNP6NA1838596NANANANANANANACCTCTTCC
  • ped : Plink format ped data

The first column stands for family name,the second column stands for the individual name,the seventh column and the after columns stand for the genotype data

309830980000CCGG
349834980000ACTG
329732970000ACTG
245224520000CCGG
425542550000ACGG
294629460000CCTG
  • map : Plinkformat map data

The first column stands for chromosome, the second column stands for the name of SNP, the thrid column stands for the genetic positon(CM), and the fourth column stands for the physical position

1SNP10.224488224488
1SNP20.293696293696
1SNP30.333333333333
1SNP40.464830464830
1SNP50.722623722623
1SNP60.838596838596
  • blupf90 : BLUPF90 format genotype data

The first column stands for individual name, the second column stands for the genotype data(numeric)

3098200000
3498112021
3297112022
2452200000
4255102011
2946212000
  • numeric : Numeric format genotype data

rownames of numeric data stands for the individual name, colnames of numeric data stands for the name of SNP, 0,1,2 stand for the numeric genotype

200000
112021
112022
200000
102011
212000
  • haplotype_hap: Haplotype format genotype data.

    Row stands for marker, column stands for individual, each individual has two columns;

    00011000
    00100100
    11000011
    00111100
    00011100
  • haplotype_sample: Haplotype format genotype data

    3098
    3498
    3297
    2452
    4255
    2946
  • haplotype_map: Haplotype format genotype data

    1SNP1224488CA
    1SNP2293696GT
    1SNP3333333TG
    1SNP4464830AG
    1SNP5722623CT
    1SNP6838596CA
  • vcf : VCF format genotype data

##fileformat=VCFv4.2
##source=“beagle.29May21.d6d.jar”
##INFO<ID=AF,Number=A,Type=Float>
##INFO<ID=IMP,Number=0,Type=Flag">
##FORMAT<ID=GT,Number=1,Type=String>
#CHROMPOSIDREFALTQUALFILTERINFOFORMAT34983297
16260M2TA.PASS.GT1|00|1
115289M17AT.PASS.GT0|00|0

Parameter

🤡Basic

  • 1: input_data_plink_ped

User-provided Plink-ped format genotype data ,data.frame or matrix class.

  • 2:input_data_plink_map

User-provided Plink-map format genotype data ,data.frame or matrix class.

  • 3:input_data_hmp

User-provided Hapmap format genotype data ,data.frame or matrix class.

  • 4:input_data_BLUPF90

User-provided BLUPF90 format genotype data ,data.frame or matrix class.

  • 5:input_data_numeric

User-provided Numeric format genotype data ,data.frame or matrix class.

  • 6:input_data_haplotype_hap

User-provided Haplotype format genotype data ,data.frame or matrix class.

  • 7:input_data_haplotype_sample

User-provided Haplotype format genotype data ,data.frame or matrix class.

  • 8:input_data_haplotype_map

User-provided Haplotype format genotype data ,data.frame or matrix class.

  • 9:input_data_vcf

User-provided VCF format genotype data ,data.frame or matrix class.

Note: input_data_numeric should contain both rownames and colnames.

In addition, for convenience, users can provide the file name, file path, and file type of genotype data directly without reading them in R environment.

  • 10:input_data_type

    File type of provided genotype data ,character class.

    • Hapmap
    • Plink
    • BLUPF90
    • Numeric
    • Haplotype
    • VCF
  • 11:input_data_path

File path of provided genotype data ,character class.

  • 12:input_data_name

File name of provided genotype data ,character class.

Note: if input_data_type is Plink or VCF, user don’t need to include suffix in the file name of genotype data.

eg. for Plink type data, files name are test1.map and test1.ped,we should setinput_data_name="test1".

  • 13:output_data_name

File name of output genotype data, character class.

  • 14:output_data_type

    File type of output genotype data, character class.

    • Hapmap
    • Plink
    • BLUPF90
    • Numeric
    • Haplotype
    • VCF

Note: users can output multiple formats of genotype data simultaneously. e.g.output_data_type=c("Hapmap","Plink","BLUPF90","Numeric"), outout 4 types of genotype data simultaneously .

  • 15:return_result

Whether return result, logical class. Default is FALSE.

Additionally, for convenience, users can save output genotype data into local computer .

  • 16:bigmemory_cal

Whether using bigmemory method to calculate. logical class. Default is FALSE.

  • 17:bigmemory_data_path

The file path bigmemory data . character class.

  • 18:bigmemory_data_name

The file name bigmemory data . character class.

  • 19:phased_genotype

Whether genotype data has been phased. logical class. Default is FALSE.

  • 20:haplotype_window_nSNP

According to the number of consecutive SNPs define haplotype block. numeric class. Default is NULL.

  • 21:haplotype_window_kb

According to the physical location define haplotype block. numeric class. Default is NULL.

  • 22:haplotype_window_block

According to user-provided block to define haplotype block . numeric class. Default is NULL.

The first column is the position of window start, the second column is the position of window end.

15
610
1115
1620
2125
2630

💨Advanced

  • 23:cpu_cores

Number of cpu in calculating, numeric class. Default is 1.

  • 24:miss_base

Missing genotype character, character class. Default is “NN”.

  • 25:miss_base_num

Missing genotype number after numeric conversion, numeric class. Default is 5.

  • 26:miss_base_num

Missing genotype number after numeric conversion, numeric class. Default is 5.

Quanshun Mei
Quanshun Mei
Postdoctoral researcher

My research interests include applying genomic selection and machine learning in animal breeding.