removing cells in minor compaction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

removing cells in minor compaction

Dave Latham
What cells, if any, are removed during minor compactions?

Cells that
(a) are beyond the TTL?
(b) are shadowed by a delete marker? (from the files compacted)
(c) are shadowed by newer versions? (assuming numVersions configured < num
versions of the cell found)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: removing cells in minor compaction

Dave Latham
And for any of the cases - if not, then why not?  Because that hasn't been
implemented, or there's an actual reason that HBase would not want to do it?
With reads for a custom time range, it's possible to still read data that
is waiting to be GCed from one of the above mechanisms and will disappear
after that happens.  Doing the GC during minor compactions as well as major
ones would change that visibility window, but doesn't seem to change that
odd behavior that is there to begin with.


On Wed, Jun 14, 2017 at 5:51 PM, Dave Latham <[hidden email]> wrote:

> What cells, if any, are removed during minor compactions?
>
> Cells that
> (a) are beyond the TTL?
> (b) are shadowed by a delete marker? (from the files compacted)
> (c) are shadowed by newer versions? (assuming numVersions configured < num
> versions of the cell found)
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: removing cells in minor compaction

stack-3
In reply to this post by Dave Latham
On Wed, Jun 14, 2017 at 5:51 PM, Dave Latham <[hidden email]> wrote:

> What cells, if any, are removed during minor compactions?
>
> Cells that
> (a) are beyond the TTL?
> (b) are shadowed by a delete marker? (from the files compacted)
> (c) are shadowed by newer versions? (assuming numVersions configured < num
> versions of the cell found)
>


Compacting, we use scanners reading hfiles. Core difference between major
and main compaction is the scanType. If major (i.e. all files in the Store
are in the compaction set), then ScanType.COMPACT_DROP_DELETES else
ScanType.COMPACT_RETAIN_DELETES.

Logic on what to retain/delete is what makes for a Scan determined by rules
in ScanQueryMatcher (Actually, compactions use CompactionScanQueryMatcher,
a subclass whose only purpose is enforcing the scanType delete policy).

To answer your questions Dave:

a.) Yes (A Scan does not let you see Cells that are beyond TTL so on
compaction, they are not 'seen' and so not written out to the new compacted
file).
b.) No (See logic in CompactionScanQueryMatcher)
c.) Yes

St.Ack
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: removing cells in minor compaction

stack-3
In reply to this post by Dave Latham
(Disclaimer: my previous message did not involve verification in code or
turning up test cases to prove my assertions. For example, the
documentation claims that we retain versions beyond the max configured when
we do minor compactions but I do not see in code how that is done. Perhaps
this is how it used to be. Need to dig more).

On Mon, Jun 19, 2017 at 8:27 AM, Dave Latham <[hidden email]> wrote:

> And for any of the cases - if not, then why not?  Because that hasn't been
> implemented, or there's an actual reason that HBase would not want to do
> it?
>

Being able to delete in minor compaction would be an improvement; we are
reading the data anyways.

Traditionally, the spoke in the wheel is the fact that we allow edits to
come in in any order -- clients can write an edit into the past or into the
future -- so we can't be sure at compaction time that we see edits in their
insert order. If sequenceid were a first class attribute of Cells, always
present, we could rely on it figuring order.

Absent sequenceid, minor compactions are always adjacent (according to the
order in which they were flushed) subsets of all files in the store; with
this precept, we know we can safely remove versions if in our subset we've
encountered > configured max versions.


> With reads for a custom time range, it's possible to still read data that
> is waiting to be GCed from one of the above mechanisms and will disappear
> after that happens.  Doing the GC during minor compactions as well as major
> ones would change that visibility window, but doesn't seem to change that
> odd behavior that is there to begin with.
>
>
Should we support retaining deletes even on major compactions for some
user-configured period?

Thanks D,
St.Ack

P.S. This section needs a tuneup:
http://hbase.apache.org/book.html#compaction


> On Wed, Jun 14, 2017 at 5:51 PM, Dave Latham <[hidden email]> wrote:
>
> > What cells, if any, are removed during minor compactions?
> >
> > Cells that
> > (a) are beyond the TTL?
> > (b) are shadowed by a delete marker? (from the files compacted)
> > (c) are shadowed by newer versions? (assuming numVersions configured <
> num
> > versions of the cell found)
> >
>
Loading...