tabs - Getting additional NA column in read.delim in R -
i reading text file in r
of form (from terminal)
hmi$ head -2 output_perl_hmi.txt 1 cg10619-rb tup 18864094 18864523 rev gfp_rnai3_r1 0.870707220482784 1 cg11050-rc cg11050 6613278 6612484 rev gfp_rnai3_r1 0.999267733859066
but when read in r using read.delim, adds additional column of na @ end. can remove column wondering why it's creating additional column , how can avoid when reading file.
> d=read.delim("output_perl_hmi.txt", header=f) > colnames(d) <-c("count", "flybasename", "genename", "start", "end", "type","sample", "posterior_probability") > head(d) count flybasename genename start end type sample posterior_probability na 1 1 cg10619-rb tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 na 2 1 cg11050-rc cg11050 6613278 6612484 rev gfp_rnai3_r1 0.9992677 na
firstly, must infer input file delimited tabs, though haven't specified this, because read.delim()
defaults sep='\t'
.
secondly, suspect reason getting column of na @ end of data have 1 trailing tab @ end of every line of input file. results in read.delim()
considering there 1 column after trailing tab, parses na, because there's nothing there.
below demonstrate this. have created 2 files, file1.txt
, file2.txt
. former contains exact input file pasted question, under assumptions (1) uses tab delimiters , (2) has 1 trailing tab on each line. latter same, without trailing tab.
to clarify whitespace, in cat
calls, pass -vet
, shows tabs ^i
, eols $
. not sufficient disambiguate data, since know input file has no circumflex or dollar, unambiguous in case.
system('cat -vet file1.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784^i$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066^i$ d <- read.delim('file1.txt', header=f ); d; ## v1 v2 v3 v4 v5 v6 v7 v8 v9 ## 1 1 cg10619-rb tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 na ## 2 1 cg11050-rc cg11050 6613278 6612484 rev gfp_rnai3_r1 0.9992677 na system('cat -vet file2.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066$ d <- read.delim('file2.txt', header=f ); d; ## v1 v2 v3 v4 v5 v6 v7 v8 ## 1 1 cg10619-rb tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 ## 2 1 cg11050-rc cg11050 6613278 6612484 rev gfp_rnai3_r1 0.9992677
a solution therefore strip trailing whitespace input file prior reading r. (note: looked using strip.white
, colclasses
, , col.names
arguments of read.table()
(which called read.delim()
, relaying ...
it) solve problem automatically stripping whitespace or ignoring columns, nothing tried worked.)
also, general interest , knowledge, if have multiple trailing tabs, each taken read.delim()
separator, , receive corresponding column in returned data.frame each such tab:
system('cat -vet file3.txt;'); ## 1^icg10619-rb^itup^i18864094^i18864523^irev^igfp_rnai3_r1^i0.870707220482784^i^i$ ## 1^icg11050-rc^icg11050^i6613278^i6612484^irev^igfp_rnai3_r1^i0.999267733859066^i^i$ d <- read.delim('file3.txt', header=f ); d; ## v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 ## 1 1 cg10619-rb tup 18864094 18864523 rev gfp_rnai3_r1 0.8707072 na na ## 2 1 cg11050-rc cg11050 6613278 6612484 rev gfp_rnai3_r1 0.9992677 na na
and complete here, tested read.delim()
see if input lines contained inconsistent numbers of delimiters. appears respect "widest" input line, meaning returned data.frame contain many columns necessary cover most-delimited line in input file. short lines have na in rightmost cells not covered in line.