Re: Scheme design questions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Scheme design questions

Naama Kraus
Thanks again, Naama

On Mon, May 19, 2008 at 9:32 AM, Jim Kellerman <[hidden email]> wrote:

> Comments inline.
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
>
>
> > -----Original Message-----
> > From: Naama Kraus [mailto:[hidden email]]
> > Sent: Sunday, May 18, 2008 9:03 PM
> > To: [hidden email]
> > Subject: Re: Scheme design questions
> >
> > Thank you very much Jim for the useful information.
> > My further questions inlined (within <<< ... >>>).
> >
> > Another question - what are the limits on number of families
> > and number of family members within one table ?
>
> Currently, there are no limits to the number of column families you
> can create. However, Google's Bigtable paper says that you should expect
> some limit (in the hundreds, i.e., < 999) but neither Bigtable nor HBase
> limit you on the number of family members. See below for explanation.
>
> > Are there any limits to the overall size of a data stored in a table ?
>
> There are no architectural limits to the size of a table.
>
>
> More below
>
> > Naama
> >
> > On Sun, May 18, 2008 at 8:08 PM, Jim Kellerman
> > <[hidden email]> wrote:
> >
> > > Comments in-line below
> > >
> > > ---
> > > Jim Kellerman, Senior Engineer; Powerset
> > >
> > >
> > > > -----Original Message-----
> > > > From: Naama Kraus [mailto:[hidden email]]
> > > > Sent: Sunday, May 18, 2008 4:01 AM
> > > > To: [hidden email]
> > > > Subject: Scheme design questions
> > > >
> > > > Hi,
> > > >
> > > > I am trying to figure out how should I design HBase
> > tables and I got
> > > > couple of questions. I'd appreciate some assistance.
> > > >
> > > > Say I have data about students confirming of - Student id
> > and some
> > > > basic information such as first name, last name, gender, address,
> > > > date she started her studies, hobbies and some areas of interest.
> > > > Additionally, for each student there is information on the course
> > > > she has taken and the final grade.
> > > >
> > > > My Questions:
> > > > 1. Should the basic attributes (first name, last name, gender
> > > > ...) share a common column family or each should have a different
> > > > family ?
> > >
> > > This kind of depends on the access pattern. For example in the
> > > Webtable example, one column contains page content which is usually
> > > processed together and another column contains page
> > attributes such as
> > > encoding, mime-type, etc.
> > >
> > > My guess is that your information should share a column family.
> >
> >
> > <<< So does this mean that a column family is stored together
> > ? In the documentation I read that regions are stored
> > together, but I thought regions are bunch of rows, each
> > containing all columns. So I am now confused, rows or columns
> > ? Could you please explain ? >>>
>
> Yes, HBase is a column oriented data store just like Bigtable.
> Adding new family members is cheap, new columns expensive.
>
> Regions are indeed a bunch of rows. A single region represents
> a row range from [low-key:high-key). For each region there is
> an HStore for each column family that has data in the region's
> row range.
>
> > >
> > > > If the second is the way to go, would it harm HBase flexibility
> > > > characteristic which allows adding a new type of
> > attribute that may
> > > > pop up after I defined the table scheme? E.g. new data
> > source comes
> > > > in with the 'age'
> > > > attribute, that was not known upon defining the scheme.
> > >
> > > This is the disadvantage of the one column per attribute approach.
> > > It is expensive to add a new column, but new family members can be
> > > added at any time.
> >
> >
> > <<< Can a column be added to an existing table then, or only
> > prior to create ? In what sense is it expensive to add a new
> > column ? >>>
>
> You can add a new column to an existing table, but you must first
> 'disable' the table (take it offline). It is expensive, because adding
> a new column family means creating a new HStore for each existing region.
>
> > >
> > >
> > > > 2. For attributes which may have multiple values, would it make
> > > > sense to define a common column family and add a column for each
> > > > value ?
> > >
> > > It might make sense in this case to have a family for the
> > multi-valued
> > > attribute and just add a new member for each new value.
> > >
> > > > 2.1 For hobbies - I'd define a 'hobby' column family
> > under which I
> > > > put each hobby in a separate column. hobby_i (i being
> > incremented by
> > > > 1 for each new hobby being inserted in the
> > > > row) as a column name and the actual hobby as a value ? Or I'd
> > > > rather have the hobby name as a column name and some
> > arbitrary value
> > > > (e.g. 1) as cell value ?
> > >
> > > I'd define a family, hobby and use a new family member for
> > each value,
> > > for example:
> > >
> > > hobby:video-games
> > > hobby:tennis
> > > hobby:floral-arranging
> > > etc.
> > >
> > > > 2.2 Similarly, for grades there could be a common grades
> > family. For
> > > > each course grade, I could put the course id as a column name and
> > > > the course grade as a value. Does it make sense ?
> > >
> > > Yes. For example:
> > >
> > > Family course:
> > >
> > > course:math101 (with value) B
> > > course:economics203 (value) c
> > > etc.
> > >
> > > > 3. Say there is the 'zipcode' attribute, and a student may have
> > > > multiple zip codes associated with her. By now, it is a
> > case similar
> > > > to question 2. But what if for each zip I have the
> > matching city and
> > > > state information. Should I create a separate table with each row
> > > > containing a zip and the corresponding city and state and
> > use join
> > > > at query time if needed ?
> > >
> > > There is no join operation in HBase. However, you could run a
> > > map/reduce job to do something like a join.
> >
> >
> > <<< Is there somewhere a code sample for doing map/reduce
> > jon-like above HBase ? >>>
>
> The best examples we have available for using HBase with map/reduce are
> in the test cases (see org.apache.hadoop.hbase.mapred.*)
>
> >
> > >
> > >
> > > For zipcode, I might do something like:
> > >
> > > Family zip:
> > >
> > > zip:12345 (value) home
> > > zip:09876 (value) school
> > > etc.
> > >
> > > > Or is there a way to de-normalize the data and somehow
> > integrate the
> > > > multiple zip-s plus the city and state of each within the
> > original
> > > > students table ?
> > >
> > > It is a little tricky to store multi-value attributes in a
> > colum that
> > > is multivalued.
> > >
> > > For example if the row key is the student name, you could have
> > > something like:
> > >
> > > Family info:
> > > info:id
> > > info:address
> > > info:zip1
> > > info:zip2
> > >
> > > or:
> > >
> > > info:id
> > > info:address
> > > info:zip (value is a serialized map of zipcode, location)
> > >
> > > > To what extent should I aspire to denormalize data ?
> > >
> > > Again it depends on your access patterns. If the data is
> > going to be
> > > accessed together, it is probably better to put them in the same
> > > family. If you know that some data will never (or
> > > rarely) be accessed togetether, then put them in separate column
> > > families.
> > >
> > > > 4. Can columns of different types (numbers/text/date)
> > share the same
> > > > column family ?
> > >
> > > There are no data type in HBase. All values are byte[]
> > >
> > > > Thanks for any help, Naama
> > > >
> > > > --
> > > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> > 00 oo 00 oo
> > > > 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read
> > > > them fairy tales. If you want them to be more
> > intelligent, read them
> > > > more fairy tales." (Albert
> > > > Einstein)
> > > >
> > > > No virus found in this incoming message.
> > > > Checked by AVG.
> > > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release
> > > > Date: 5/17/2008 6:26 PM
> > > >
> > > No virus found in this outgoing message.
> > > Checked by AVG.
> > > Version: 8.0.100 / Virus Database: 269.23.20/1452 - Release Date:
> > > 5/17/2008
> > > 6:26 PM
> > >
> >
> >
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> > oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be
> > intelligent, read them fairy tales. If you want them to be
> > more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release
> > Date: 5/18/2008 9:31 AM
> >
> No virus found in this outgoing message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 269.23.20/1453 - Release Date: 5/18/2008
> 9:31 AM
>



--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)