tabs - Getting additional NA column in read.delim in R -


i reading text file in r of form (from terminal)

hmi$ head -2 output_perl_hmi.txt  1   cg10619-rb  tup 18864094    18864523    rev gfp_rnai3_r1    0.870707220482784    1   cg11050-rc  cg11050 6613278 6612484 rev gfp_rnai3_r1    0.999267733859066    

but when read in r using read.delim, adds additional column of na @ end. can remove column wondering why it's creating additional column , how can avoid when reading file.

> d=read.delim("output_perl_hmi.txt", header=f) > colnames(d) <-c("count", "flybasename", "genename", "start", "end", "type","sample", "posterior_probability") > head(d)   count flybasename genename    start      end type       sample posterior_probability na 1     1  cg10619-rb      tup 18864094 18864523  rev gfp_rnai3_r1             0.8707072 na 2     1  cg11050-rc  cg11050  6613278  6612484  rev gfp_rnai3_r1             0.9992677 na 

firstly, must infer input file delimited tabs, though haven't specified this, because read.delim() defaults sep='\t'.

secondly, suspect reason getting column of na @ end of data have 1 trailing tab @ end of every line of input file. results in read.delim() considering there 1 column after trailing tab, parses na, because there's nothing there.

below demonstrate this. have created 2 files, file1.txt , file2.txt. former contains exact input file pasted question, under assumptions (1) uses tab delimiters , (2) has 1 trailing tab on each line. latter same, without trailing tab.

to clarify whitespace, in cat calls, pass -vet, shows tabs ^i , eols $. not sufficient disambiguate data, since know input file has no circumflex or dollar, unambiguous in case.

system('cat -vet file1.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784^i$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066^i$ d <- read.delim('file1.txt', header=f ); d; ##   v1         v2      v3       v4       v5  v6           v7        v8 v9 ## 1  1 cg10619-rb     tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 na ## 2  1 cg11050-rc cg11050  6613278  6612484 rev gfp_rnai3_r1 0.9992677 na system('cat -vet file2.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066$ d <- read.delim('file2.txt', header=f ); d; ##   v1         v2      v3       v4       v5  v6           v7        v8 ## 1  1 cg10619-rb     tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 ## 2  1 cg11050-rc cg11050  6613278  6612484 rev gfp_rnai3_r1 0.9992677 

a solution therefore strip trailing whitespace input file prior reading r. (note: looked using strip.white, colclasses, , col.names arguments of read.table() (which called read.delim(), relaying ... it) solve problem automatically stripping whitespace or ignoring columns, nothing tried worked.)

also, general interest , knowledge, if have multiple trailing tabs, each taken read.delim() separator, , receive corresponding column in returned data.frame each such tab:

system('cat -vet file3.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784^i^i$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066^i^i$ d <- read.delim('file3.txt', header=f ); d; ##   v1         v2      v3       v4       v5  v6           v7        v8 v9 v10 ## 1  1 cg10619-rb     tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 na  na ## 2  1 cg11050-rc cg11050  6613278  6612484 rev gfp_rnai3_r1 0.9992677 na  na 

and complete here, tested read.delim() see if input lines contained inconsistent numbers of delimiters. appears respect "widest" input line, meaning returned data.frame contain many columns necessary cover most-delimited line in input file. short lines have na in rightmost cells not covered in line.


Popular posts from this blog