InsideDarkWeb.com

Separating peaks of chip-seq with specific length

My data contain several chip-seq results. I have the peaks called by MACS2.I wanted to only look at those peaks that their size is e.g 500bp to 1000bp. How can I separate those peaks efficiently?

I know one way is to take the peaks.xls file sort them based on the length column and take out the peaks less than 1kb. But I was looking for a more efficient way to do that. Any suggestion?

Many Thanks

Bioinformatics Asked on November 15, 2021

2 Answers

2 Answers

You can use rtracklayer from R bioconductor to import it, filter and do much more stuff, i use the code from this blog for importing the narrow bed file. Below I use an example file since you did not provide yours:

library(rtracklayer)
fl = "zas1_default_peaks.narrowPeak"

extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
                          qValue = "numeric", peak = "integer")
gr <- import(fl, format = "BED",
                        extraCols = extraCols_narrowPeak)

head(gr)

GRanges object with 6 ranges and 6 metadata columns:
      seqnames        ranges strand |                name     score signalValue
         <Rle>     <IRanges>  <Rle> |         <character> <numeric>   <numeric>
  [1] AB325691     8982-9219      * | zas1_default_peak_1        18     1.31697
  [2] AB325691   16685-18050      * | zas1_default_peak_2        59     1.51283
  [3]        I 101971-102188      * | zas1_default_peak_3        17     1.23717
  [4]        I 102390-103538      * | zas1_default_peak_4        19     1.24967
  [5]        I 152286-152496      * | zas1_default_peak_5        23     1.37631
  [6]        I 175025-175577      * | zas1_default_peak_6        69     1.62063
         pValue    qValue      peak
      <numeric> <numeric> <integer>
  [1]   3.37358   1.89101       146
  [2]   8.12841   5.92232       419
  [3]   3.26012   1.79545        64
  [4]   3.47875   1.97673       708
  [5]   3.89921   2.32948       136
  [6]   9.34429    6.9608       374

The narrowBed and peaks.xls contains the same information. If you really really need to read in the xls, you can do:

library(GenomicRanges)
peaks = read.table("zas1_default_peaks.xls",comment="#",header=TRUE)
gr = makeGRangesFromDataFrame(peaks,keep.extra.columns=TRUE)

Once you have the genomic ranges object, filtering is the same:

 gr[width(gr)>500 & width(gr)<1000]
GRanges object with 208 ranges and 7 metadata columns:
        seqnames          ranges strand |    length abs_summit    pileup
           <Rle>       <IRanges>  <Rle> | <integer>  <integer> <numeric>
    [1]        I   175025-175577      * |       553     175399    186.99
    [2]        I   186807-187489      * |       683     187009    208.66
    [3]        I   339070-339700      * |       631     339258    222.91
    [4]        I   431557-432088      * |       532     431734    195.66

Answered by StupidWolf on November 15, 2021

You can do it by awk.

awk -v OFS="t" '{ if ( ($3-$2) >= 500 && ($3-$2) <= 1000) { print } }' in.bed > out.bed

Assuming there is no header.

Answered by geek_y on November 15, 2021

Add your own answers!

Related Questions

Ask a Question

Get help from others!

© 2021 InsideDarkWeb.com. All rights reserved.