Fork me on GitHub

SAM格式详解

简介

SAM是一种序列比对格式标准, 由sanger制定,是以TAB为分割符的文本格式。主要应用于测序序列mapping到基因组上的结果表示,当然也可以表示任意的多重比对结果。

SAM分为两部分,注释信息(header section)和比对结果部分(alignment section),注释信息可有可无,都是以@开头,用不同的tag表示不同的信息,主要有@HD,说明符合标准的版本、对比序列的排列顺序;@SQ,参考序列说明;@RG,比对上的序列(read)说明;@PG,使用的程序说明;@CO,任意的说明信息。

比对结果

比对结果部分(alignment section),每一行表示一个片段(segment)的比对信息,包括11个必须的字段(mandatory fields)和一个可选的字段,字段之间用tag分割。必须的字段有11个,顺序固定,不可用时,根据字段定义,可以为’0‘或者’*‘,12个字段详情如下:

1QNAMEQuery template/pair NAME
2FLAGbitwise FLAG
3RNAMEReference sequence NAME
4POS1-based leftmost POSition/coordinate of clipped sequence
5MAPQMAPping Quality (Phred-scaled)
6CIGARextended CIGAR string
7MRNMMate Reference sequence NaMe (‘=’ if same as RNAME)
8MPOS1-based Mate POSistion
9TLENinferred Template LENgth (insert size)
10SEQquerySEQuence on the same strand as the reference
11QUALquery QUALity (ASCII-33 gives the Phred base quality)
12+OPTvariable OPTional fields in the format TAG:VTYPE:VALUE

FLAG详解

Each bit in the FLAG field is defined as:

0x0001pthe read is paired in sequencing
0x0002Pthe read is mapped in a proper pair
0x0004uthe query sequence itself is unmapped
0x0008Uthe mate is unmapped
0x0010rstrand of the query (1 for reverse)
0x0020Rstrand of the mate
0x00401the read is the first read in a pair
0x00802the read is the second read in a pair
0x0100sthe alignment is not primary
0x0200fthe read fails platform/vendor quality checks
0x0400dthe read is either a PCR or an optical duplicate
0x0800Sthe alignment is supplementary

CIGAR格式

A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format further allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:

OperationDescription
MAlignment match (can be a sequence match or mismatch)
IInsertion to the reference
DDeletion from the reference
NSkipped region from the reference
SSoft clip on the read (clipped sequence present in)
HHard clip on the read (clipped sequence NOT present in)
PPadding (silent deletion from the padded reference sequence)
坚持原创技术分享,您的支持将鼓励我继续创作!