While querying these tables is easy with SQL standpoint, it's common that you'll want to get some range of records based on only a few of those hundred-plus columns. Let's say there are 132 columns, and some of them are really long text fields, each different column one following the other and use up maybe 10K per record. So anyway it's common to have a boatload of columns in a table. There are other advantages such as retaining state-in-time data. But it becomes much easier to query since all the joins are worked out. This is so because we often use Hadoop as a place to denormalize data from relational formats - yes, you get lots of repeated values and many tables all flattened into a single one. It is common to have tables (datasets) having many more columns than you would expect in a well-designed relational database - a hundred or two hundred columns is not unusual. Parquet, and other columnar formats handle a common Hadoop situation very efficiently. Other tricks of various formats (especially including compression) involve whether a format can be split - that is, can you read a block of records from anywhere in the dataset and still know it's schema? But here's more detail on columnar formats like Parquet. adding or removing columns from a record. AVRO is slightly cooler than those because it can change schema over time, e.g. Record oriented formats are what we're all used to - text files, delimited formats like CSV, TSV. ![]() ![]() I think the main difference I can describe relates to record oriented vs.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |