Post Syndicated native mikesefanov original



We’re excited come co-host the 10th yearly Hadoop Summit, the leading conference for the Apache Hadoop community, following on June 13 – 15 in ~ the san Jose Convention Center. In the last couple of years, the Hadoop Summit has increased to sheathe all points data beyond just Apache Hadoop – such together data science, cloud and also operations, IoT and also applications – and has to be aptly renamed the DataWorks Summit. The three-day routine is bursting at the seams! below are just a few of the factors why you cannot miss out on this must-attend event:

Familiarize yourself with the cutting edge in Apache project developments from the committersLearn from her peers and also industry experts about innovative and also real-world usage cases, advancement and management tips and tricks, success stories and best practices to leverage all her data – on-premise and in the cloud – to drive predictive analytics, spread deep-learning and also artificial intelligence initiativesCheck the end our keynotes, meetups, trainings, technological crash courses, birds-of-a-feather sessions, women in huge Data and more

Similar come previous years, we look front to proceeding Yahoo’s decade-long legacy of thought leadership at this year summit. Sign up with us for an comprehensive look in ~ Yahoo’s Hadoop society and because that the recent in modern technologies such together Apache Tez, HBase, Hive, Data Highway Rainbow, mail Data Warehouse and also Distributed Deep learning at the breakout sessions below. Or, protect against by Yahoo kiosk #700 in ~ the neighborhood showcase.

You are watching: Hadoop summit 2017 san jose


Also, as a co-host the the event, Yahoo is pleasure to sell a 20% discount because that the summit through the password MSPO20. Register below for Hadoop Summit, mountain Jose, California!

DAY 1. TUESDAY June 13, 2017

12:20 – 1:00 P.M. TensorFlowOnSpark – Scalable TensorFlow discovering On Spark Clusters

Andy Feng – VP Architecture, large Data and an equipment Learning

Lee Yang – Sr. Principal Engineer

In this talk, us will present a new framework, TensorFlowOnSpark, because that scalable TensorFlow learning, that was open up sourced in Q1 2017. This brand-new framework enables easy trial and error for algorithm designs, and also supports scalable training & inferencing on Spark clusters. The supports every TensorFlow functionalities consisting of synchronous & asynchronous learning, model & data parallelism, and also TensorBoard. It gives architectural flexibility for data ingestion come TensorFlow and network protocols because that server-to-server communication. Through a few lines of code changes, an currently TensorFlow algorithm deserve to be transformed into a scalable application.

2:10 – 2:50 P.M. Managing Kernel Upgrades at range – The Dirty Cow Story

Samy Gawande – Sr. Operations Engineer

Savitha Ravikrishnan – Site reliability Engineer

Apache Hadoop in ~ Yahoo is a enormous platform through 36 various clusters spread throughout YARN, Apache HBase, and also Apache Storm deployments, totaling 60,000 servers consisted of of 100s of different hardware configurations accumulated over generations, presenting distinctive operational challenges and also a range of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with big scale kernel update on heterogeneous platforms in ~ tight timeframes with 100% uptime and no organization or data loss through the Dirty COW use instance (privilege escalation vulnerability uncovered in the Linux Kernel in so late 2016).

5:00 – 5:40 P.M. Data Highway Rainbow – Petabyte Scale occasion Collection, Transport, and Delivery in ~ Yahoo

Nilam Sharma – Sr. Software application Engineer

Huibing Yin – Sr. Software application Engineer

This speak presents the architecture and also features that Data Highway Rainbow, Yahoo’s held multi-tenant facilities which offers event collection, transport and also aggregated distribution as a service. Data Highway supports arsenal from lot of data centers & aggregated distribution in primary Yahoo data centers which administer a big data computer cluster. Indigenous a shipment perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and also Kafka; through Storm & Kafka endpoints tailored towards latency sensitive consumers.

DAY 2. WEDNESDAY June 14, 2017

9:05 – 9:15 A.M. Yahoo general Session – Shaping Data Platform for Lasting Value

Sumeet sink – Sr. Director, Products

With a long background of open invention with Hadoop, Yahoo continues to invest in and expand the communication capabilities by advertise the borders of what the communication can accomplish for the entire organization. In the last 11 years (yes, the is that old!), the Hadoop communication has presented no signs of giving up or providing in. In this talk, we discover what makes the common multi-tenant Hadoop platform so one-of-a-kind at Yahoo.

12:20 – 1:00 P.M. CaffeOnSpark update – current Enhancements and Use Cases

Mridul Jain – Sr. Primary Engineer

Jun Shi – principal Engineer

By combining salient functions from deep learning frame Caffe and also big-data frameworks Apache Spark and also Apache Hadoop, CaffeOnSpark allows distributed deep discovering on a cluster of GPU and CPU servers. We released CaffeOnSpark together an open source project in early 2016, and also shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent breakthrough of CaffeOnSpark. We will certainly highlight brand-new features and capabilities: linked data layer i beg your pardon multi-label datasets, dispersed LSTM training, interleave experimentation with training, monitoring/profiling framework, and docker deployment.

12:20 – 1:00 P.M. Tez Shuffle Handler – Shuffling at range with Apache Hadoop

Jon Eagles – principal Engineer

Kuhu Shukla – software Engineer

In this talk we introduce a brand-new Shuffle Handler because that Tez, a YARN assistant Service, the addresses the shortcomings and performance bottlenecks the the legacy MapReduce Shuffle Handler, the default shuffle company in Apache Tez. The Apache Tez Shuffle Handler add to composite bring which has support because that multi-partition having to minimize performance sluggish down and also provides deletion APIs to mitigate disk intake for lengthy running Tez sessions. Together an emerging an innovation we will synopsis future roadmap for the Apache Tez Shuffle Handler and carry out performance evaluation outcomes from real civilization jobs at scale.

2:10 – 2:50 P.M. Achieve HBase Multi-Tenancy through RegionServer Groups and Favored Nodes

Thiruvel Thirumoolan – principal Engineer

Francis Liu – Sr. Major Engineer

At Yahoo! HBase has been running together a hosted multi-tenant service because 2013. In a solitary HBase cluster we have about 30 tenants running various varieties of workloads (ie batch, close to real-time, ad-hoc, etc). We will certainly walk with multi-tenancy features explaining our motivation, how they work as well as our experiences to run these multi-tenant clusters. These attributes will be accessible in Apache HBase 2.0.

2:10 – 2:50 P.M. Data driving Yahoo mail Growth and also Evolution with a 50 PB Hadoop Warehouse

Nick Huang – Director, Data Engineering, Yahoo mail

Saurabh Dixit – Sr. Major Engineer, Yahoo Mail

Since 2014, the Yahoo mail Data design team took on the job of revamping the mail data warehouse and analytics framework in bespeak to journey the ongoing growth and also evolution that Yahoo Mail. Follow me the method we have built a 50 PB Hadoop warehouse, and surrounding analytics and an equipment learning programs that have transformed the method data theatre in Yahoo Mail. In this conference we will share our endure from this 3 year journey, native the device architecture, analytics equipment built, to the learnings from development and drive for adoption.

DAY3. THURSDAY June 15, 2017

2:10 – 2:50 P.M. OracleStore – A very Performant RawStore Implementation because that Hive Metastore

Chris Drome – Sr. Major Engineer

Jin sun – principal Engineer

Today, Yahoo uses Hive in countless different spaces, indigenous ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive come real-time queries, such together those produced by interaction BI report systems. In order for Hive to succeed in this space, it need to be performant in all elements of query execution, from query compilation to project execution. One together component is the communication with the basic database in ~ the main point of the Metastore. As an different to ObjectStore, we produced OracleStore together a proof-of-concept. Freed that the restrictions enforced by DataNucleus, us were may be to architecture a more performant database schema that far better met ours needs. Then, we implemented OracleStore with certain goals integrated from the start, such as ensuring the deduplication that data. In this talk we will talk about the details behind OracleStore and also the gains that were realized with this alternative implementation. These encompass a palliation of 97%+ in the warehouse footprint of many tables, and also query performance the is 13x faster than ObjectStore through DirectSQL and 46x quicker than ObjectStore there is no DirectSQL.

3:00 P.M. – 3:40 P.M. Cartridge – A genuine Time Data query Engine

Akshai Sarma – Sr. Software program Engineer

Michael Natkovich – Director, Engineering

Bullet is an open up sourced, lightweight, pluggable querying device for streaming data without a persistence layer enforced on top of Storm. It enables you to filter, project, and aggregate on data in transit. It contains a UI and WS. Instead of to run queries ~ above a finite set of data the arrived and was persisted or to run a static query defined at the startup the the stream, ours queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant mechanism that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any type of streaming data source. It have the right to be configured to check out from equipment such together Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to do its accumulation operations such as distinct, counting distinct, sum, count, min, max, and also average.

See more: In Males, Which Of The Following Includes The Correct Target And Result Of Lh Stimulation?

3:00 P.M. – 3:40 P.M. Yahoo – Moving past Running 100% of Apache Pig tasks on Apache Tez

Rohini Palaniswamy – Sr. Principal Engineer

Last year in ~ Yahoo, us spent great effort in scaling, stabilizing and also making Pig on Tez production ready and also by the end of the year retired running Pig jobs on Mapreduce. This speak will information the performance and resource utilization renovations Yahoo accomplished after migrating all Pig jobs to run on Tez. After effective migration and the boosted performance us shifted our focus to addressing few of the bottlenecks we determined and brand-new optimization concepts that we come up v to make it go even faster. We will go over the brand-new features and work excellent in Tez to do that take place like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. Us will additionally cover exciting new features the were added to Pig for power such as bloom join and also byte code generation.

4:10 P.M. – 4:50 P.M. Leveraging Docker because that Hadoop construct Automation and big Data stack Provisioning

Evans Ye, software Engineer

Apache Bigtop as an open source Hadoop distribution, focuses on arising packaging, testing and deployment solutions that assist infrastructure engineers to construct up their very own customized large data platform as basic as possible. However, packages deployed in manufacturing require a heavy CI testing framework to ensure its quality. Numbers of Hadoop component need to be ensured to work perfectly with each other as well. In this presentation, fine talk about how Bigtop deliver its containerized CI structure which have the right to be directly replicated through Bigtop users. The core transformation here room the newly occurred Docker Provisioner that leveraged Docker because that Hadoop deployment and also Docker Sandbox for developer to easily start a huge data stack. The content of this talk has the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a power structure of docker images we designed, and also several contents we occurred such as Bigtop Toolchain to accomplish build automation.

Register below for Hadoop Summit, san Jose, California with a 20% discount code MSPO20

Questions? Feel free to reach the end to united state at hope to watch you there!

2013201420162017ADADIadsAIALAAllAnalyticsApacheApache HBaseAPIsappaptArchitectureARMartartificial intelligenceAspectATIautomationAWSBASICBest practicesBETTBig DatabigdatabingbleCCADCalicamCASCasecasesChallengeciciaCIScloudcodecommunityconferenceCoreCrash Courseculturedatadatabasedeadeep divedeep learningDelldeploymentdesigndetdevelopmentdirty cowDockerdowndpdressecedEdgeeffEngineeringertseteueventformFrameworkFunGeneralgenerationsGoGREhadoophardwareHAThbasehistoryhiveHoChpibmICEimprovementsindustryinfrastructureinnovationintelintelligenceinteractioniosIOTIPiqissJobskernellightlinuxLTEluaMacmachine learningmailMakemakingmapMICROSmicrosoftmitmonitoringmovMPANCRNECNESNetworknistnseobjectsopen sourceoperaOracleORGOSSOtherOUsPerformancePINPlayPPLPresentprivilege escalationprojectprotocolspspthRratraterawRDSReleaserestROVRTIrunningS.SAMScalescienceserverserversSketchessoftwarespaceSparkspeakersqlSSEStarstartupsstoragestreamingsupportSynctalkteateamtechTechnicalTechnologytedTensorFlowtestingthingsTodaytortrainingUIUKununifiUnityUpgradeUSUSTRUXwarWorkyahooPost navigation
Previous PostThe resource Groups Tagging API renders It much easier to list Your sources by utilizing a new Pagination ParameterNext PostHouston we have actually a problem!