Parallelism in R

  • Package : {parallel}

Background


  • The {parallel} package in R script is the 'coarse-grained parallelization'.

  • There are several implementations of this model in the functions mclapply, parLapply as near-drop-in replacements for lapply.

  • The are many replacements for single computing commands, for example, parLapply vs. lapply, parRapply vs. apply(,1,), parCapply vs. apply(,2,), etc.

global variable: rnames


  • Global variable, named rnames, is used as the reference making parallel cluster fetch, calculate and modify the content.
row.names             Proteins                  Sequence.window
        1               A0FGR8  LLASPGHISVKEPTPSIASDISLPIATQELR
        2               A1L390  SKSSGFGSPRLVSRSSSVLSLEGSEKGLARH
        3               O00505  RDEHLLKKRNVPQEESLEDSDVDADFKAQNV
        4               O00566  SPVFSDEDSDLDFDISKLEQQSKVQNKGQGK
        5               O00567  KPKKKKSFSKEELMSSDLEETAGSTSIPKRK
        6               O14646  KNGKKILGQKKRQIDSSEEDDDEEDYDNDKR
        7               O14646  NGKKILGQKKRQIDSSEEDDDEEDYDNDKRS
        8               O14745  LAEAALESPRPALVRSASSDTSEELNSQDSP
        9               O14745  ESPRPALVRSASSDTSEELNSQDSPPKQDST
       10 P19105;P24844;O14950  KRTKTKTKKRPQRATSNVFAMFDQSQIQEFK
       11               O14974  SKEGQKEKDTAGVTRSASSPRLSSSLDNKEK
       12               O14974  SDENEQEQQSDTEEGSNKKETQTDSISRYET
       13               O15047  PSEISEASEEKRPRPSTPAEEDEDDPEQEKE
       14               O43318  IQDLTVTGTEPGQVSSRSSSPSVRMITTSGP
       15               O43379  LPSVMAGVPARRGQSSPPPAPPICLRRRTRL
       16               O43719  EDDDSNEKLFDEEEDSSEKLFDDSDERGTLG
       17               O43719  KKAEEGDADEKLFEESDDKEDEDADGKEVED
       18               O60231  EKNRSYRLLEDSEESSEETVSRAGSSLQKKR
       19               O60271  AGLDTEGSKQRSASQSSLDKLDQELKEQQKE
       20               O60293  RKWKPKFWRKPISDNSFSSDEEQSTGPIKYA

Result storage


  • Variable, rowExtract, is used as the storage of results from each clusters executing parallel computing. The following is the original content of rowExtract.
row.names    V1    V2    V3
        1    X1    X1    X1
        2    X3    X3    X3
        3    X5    X5    X5
        4    X8    X8    X8
        5    X9    X9    X9
        6    X10    X10    X10
        7    X11    X11    X11
        8    X13    X13    X13
        9    X15    X15    X15
        10    X16    X16    X16
        11    X17    X17    X17
        12    X18    X18    X18
        13    X19    X19    X19
        14    X20    X20    X20

Core code using {parallel}


  • Use function getSepRowInfo to fetch each row content of rowExtract and then extract number, e.g. 1,3,5. Further, use number to map the row.names of rnames and then fetch corresponding its protein session name and sequence. Use the function, transpose of matrix, to modify the result into current format.
getSepRowInfo <- function(data) {
    #rowExtract <<- rbind(rowExtract, rnames[as.numeric(substr(data,2,nchar(data))),1:2])
    currentData <- data[1]
    numberInData <- as.numeric(substr(data[1],2,nchar(data[1])))
    data[2] <- rnames[numberInData,1]
    data[3] <- rnames[numberInData,2]
    return(data)
}

# mclapply could not execute under windows system due to forking issues
#mclapply(get_fill_info,getSepRowInfo,
#    mc.silent = FALSE,
#    mc.cores = getOption("mc.cores", 1),
#    mc.cleanup = TRUE, mc.allow.recursive = TRUE)

# another method for parallel computing
# create two computing clusters
cl <- makeCluster(mc <- getOption("cl.cores", 2))

# use global variables rnames, this action must be exectued
clusterExport(cl=cl, varlist=c("rnames"))

# start to execute parallel computing by row unit of the matrix
rowExtract <- parRapply(cl,rowExtract,getSepRowInfo)

# transform data because results from parallel computing are 'list' and then concatenate
rowExtract <- t(matrix(rowExtract,nrow=3))

# stop the cluster created
stopCluster(cl)

results matching ""

    No results matching ""