Aleksei Udatšnõi – Crunching thousands of events per second in nearly real time
Imagine you have a product which generates up to 10 thousands events per second or around 1 billion events per day. This live stream of data need to be tracked, processed and presented to end-users in a visually appealing way. The solution needs to be integrated into a traditional web application. That is the real use case at Softonic. In this talk we will show how it was solved in Softonic. We use the stack of technologies around Big Data to process and store live stream of data and present the results to users in nearly real time. This real-life solution is built around Hadoop ecosystem and it includes Flume, Hive, Oozie and Impala. We will show how to store and query such volumes of data using NoSQL database and how to build a scalable end-user web application using nearly real time data feed.
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase
Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 1 million events/ second and stores and scales to billions of data points. Solution includes leveraging Apache HBase & Redis for storage and Apache Storm for building real time analytics. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutionsNote: This is a default abstract – talk can be customized to focus on one area or the other – or can be change to 40 minute / 25 minute talk. More importantly – we feel its a fascinating story of how these big data technologies can be actually used in real life situations – and powerful applications are built when you combine NoSQL store such as HBase and real time processing power of Storm. Also some times small is big with clever algorithms such as bloom filters and hyperloglog which can be handy when building scalable systems. I am very passionate to share this story – so do let me know if you have any questions – I hope we can work out some way to make this very useful for the conference attendees
Clinton Gormley – Elasticsearch Query DSL – Not just for wizards…
The Elasticsearch Query DSL is a rich, flexible, powerful query language for full text and structured search, but with power comes complexity. Which of the 40 available queries should you use? What's a filter and when should you use it? How do you combine multiple filters, or multiple queries or queries with filters?To most users, "relevance", and how it is affected by different queries, is a black box. Multi-field queries in particular can be difficult to get right if you don't understand how they work. In this talk, I will explain the Query DSL from the ground up: how filters and queries use the inverted index to find matching documents, how the relevance score is calculated, how to combine the filter/query building blocks into complex statements. And finally, I will talk about the pitfalls of multi-field queries and how to avoid them.
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns
In this session, you'll see how to leverage the best features of Cassandra to solve real world issues (Rate limiting/anti fraud system, account validation, security token …). We'll also highlight some common anti-patterns (queue,partition key miss,CQL3 null) and see how to solve them in the Cassandra way
Ellen Friedman – Thinking With Your Eyes Closed
NoSQL is one of the most creative movements to emerge in database technology of recent history. How can you continue to foster this creative strength at the same time as you attend to the demands of NoSQL growing into mainstream, enterprise-ready technology? This talk will trace the connections between careful, disciplined work and leaps of insight and advancement in fields ranging from molecular biology to oceanography to the manufacture of modern “smart” textiles. We will examine current trends in the NoSQL movement including increasingly widespread use of NoSQL techniques to implement large scale time-series databases and the paradoxical emergence of SQL in NoSQL environments. Finally, rather than trying to predict the future, we’ll look at what future you choose to build.
Eric Redmond – Distributed Search on Riak 2.0
Riak excels at one type of query: key puts and key gets. But the world demands more from a database. Since Basho isn't primarily a search company, we decided to leverage the power of Solr for Riak 2.0. This is a walkthrough of what new features we added, how works, and why you'd want to use it. Also, of course, live demos.
Frank Celler – Processing large-scale graphs with Google(TM) Pregel
Many popular graph databases are optimized to run on a single machine, using efficient traversals to query the stored graphs. This boosts performance of algorithms originating at a single vertex and iterating through the graph e.g. finding shortest paths or neighbors. However, graphs are getting bigger and traversals are poorly performing if they require a large depth. If you need to distribute a large-scale graph thru several machines, traversals won't be the best choice (in case of performance) to process the graph. Therefore Google has released it's Pregel framework offering an environment to query distributed graphs, Pregel is also known as the map-reduce for graphs. In this talk I want to present the architecture and requirements of the Pregel framework and introduce you to the different mind-set required to write a Pregel algorithm. Furthermore I will give a short introduction to three implementations or Pregel — Giraph, TinkerPop3 and ArangoDB.
Giovanni Lanzani – SQL & NoSQL databases for data driven applications
For data to be the fuel of the 21th century, and for data science to live up to its promise as adriver of innovation, their application should not be confined to dashboards and static analyses.Instead they should be the driver of real applications that support the organisations that own orgenerates the data. Most of these applications are web-based and require real-time access to thedata. However, many Big Data analyses and tools are inherently batch-driven and not well suited forsecure, real-time and performance-critical connections with applications. Trade-offs become ofteninevitable, especially when mixing multiple tools and data sources.In this talk we will describe our journey to build a data driven application at a large Dutchfinancial institution. We will dive into the issues we faced, our considerations and the technicalchoices we made in order to perform data analyses but also drive a web-based, real-timeapplications. We considered and used Impala, Hbase, and MongoDB, but also conventional SQL databasessuch as MySQL and PostgreSQL. Important aspects in our journey were, among others, the handling ofgeographical data, the access to hundreds of millions of records as well as the real time analysisof millions or data points.
Glynn Bird – Cloudant – Building applications for success.
All too often, web applications are built to work in development but are not capable of scaling when success arrives. Whether the application is a log aggregator that can't deal with the throughput, a blog that can't handle traffic when it hits the heights of Google's rankings or a mobile game that goes viral, an application can become the victim of its own success. By building with Cloudant from the outset, and architecting the application to scale by design, we can build apps that scale as the traffic, data-volumes and users arrive. Using several real-life use cases, this talk will detail how Cloudant can solve an application's data storage, search and retrieval needs, scaling easily with success!
Johnny Miller – Cassandra + Spark = Awesome
This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.
Jordi Nin – Hermes: Distributed social network monitoring system
Nowadays, social network services play a very important role in the way people interact with each other and with the world. This generates big amounts of data that can be used to study social relationships and extract useful information about preferences and trends.When analysing this information, two main problems emerge: The need to aggregate different data coming from multiple sources, and hardware limitations due to the incapability traditional systems have to deal with large amounts of data. In order to solve the problems mentioned before, Hermes aims to implement a distributed, scalable social media analysis tool, ready to connect and gather data from multiple sources and show the aggregated results in real-time using couchdb, elasticsearch and kibana.
Josep Lluis Larriba-Pey – Graph databases go mobile, Sparksee 5 mobile use cases
The use of graph databases is becoming more and more popular. Sparksee is a clear example of a High Performance graph database that allows for high compression and small software footprint, allowing for very compact and efficient solutions in the business world. Sparsity Technologies has recently released Sparksee 5 mobile for iOS, Android and BB10, allowing for high performance mobile applications to be boosted with high performance analytics. Sparksee is a research based software with a considerable number of papers published on it, showing the importance of research in high end technologies.In this talk, we will present Sparksee 5 mobile and explain a few use cases in the area of analytics for Social and Open Data where the use of graphs boosts job search, private recommendation, community search and personal tourist route planning.
Kai Wähner – Real World Use Cases for Realtime In-Memory Computing
NoSQL is not just about different storage alternatives such as document store, key value store, graphs or column-based databases. The hardware is also getting much more important. Besides common disks and SSDs, enterprises begin to use in-memory storages more and more because a distributed in-memory data grid provides very fast data access and update. While its performance will vary depending on multiple factors, it is not uncommon to be 100 times faster than corresponding database implementations. For this reason and others described in this session, in-memory computing is a great solution for lifting the burden of big data, reducing reliance on costly transactional systems, and building highly scalable, fault-tolerant applications.The session begins with a short introduction to in-memory computing. Afterwards, different frameworks and product alternatives are discussed for implementing in-memory solutions. Finally, the main part of this session shows several different real world uses cases where in-memory computing delivers business value by supercharging the infrastructure.
Kristoffer Dyrkorn – Beating the traffic jam using NoSQL
Most people have experienced the boredom of being stuck in traffic. Up-to-date and credible information about congestions and detours could save us time and frustration in our everyday lives.The Norwegian Public Roads Administration is now building a new infrastructure for road traffic measurements, and the system will provide high-quality, near-realtime information as publicly available open data. The project relies heavily on NoSQL technology (Elasticsearch) for high-performance data gathering and statistical analysis.This talk will give a walkthrough of the project and the solution and show how NoSQL has helped in building an application that meets demanding requirements. Several use cases that illustrate the value of the system, both for the general public and for private companies and public institutions, will be given.
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB
NoSQL databases are renowned for their good horizontal scalability and sharding is these days an essential feature for every self-respecting DB. However, most systems choose to offer less features with respect to joins and aggregations in queries than traditional relational DBMS do. In this talk I report about the joys and pains of (re-)implementing the powerful query language AQL with joins and aggregations for ArangoDB. I will cover (distributed) execution plans, query optimisation and data locality issues.
Rafael Gimenez – Scaling up in a world of geolocated data
While the implementation of analytic operations on distributed computing frameworks has been widely describing, enabling the computational core of a Big Data system with capabilities for supporting geospatial querying on data is yet a challenging issue. This session aims to target that specific aspect by reviewing how researchers at BDigital Technology Centre have designed and implemented a stack for advanced Machine Learning on Urban Data by providing a way to geoquery massive amounts of HDFS data from Spark processes without hindering the overall system performance..The geospatial dimension of data is getting revealed as the most natural, powerful and intuitive way to explore the expanding world of data and services. The ability to rely on real-world axis such as places, people, events and things can provide better answers for everyday tasks for individuals, as well as a deep understanding for businesses and administrations.The Urban Data Analytics team at BDigital research efforts are focused on that scenario, with an offering built upon the ability to rapidly deploy pre-defined (but also arbitrary) analytic functions on geospatial time-series of data. Currently available developments already provide support for characterization, classification, clustering, anomaly detection and trajectory mining, while multivariate analytics and predictive functions (both on single and combined time-series) are targeted for the near future.In order to enable such analytic operations on geospatially enabled data, the underlying computational infrastructure must provide the distributed computational processes with a tool to support large-speed and highly dynamic geoquerying operations on massive amounts of data. The combination of end-to-end geoquerying components and GIS enhancements for HDFS data has been implemented and tested by BDigital as the most promising solution for such requirements.
Salvatore Sanfilippo – How Redis Cluster works, and why
In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.
Sebastian Cohnen – Building a Startup with NoSQL
At StormForger we use several NoSQL systems to handle all kinds of different data. We have a lot of time series data based on the fact, that we do load testing and performance analysis of HTTP-based infrastructure and services. For time series data, we use InfluxDB. We also use several Redis instances for caching and storing structured data, that needs to be fast on read and write access. Lately we also started to integrate ArangoDB into our architecture, which is a perfect fit for storing and working with our complex test case definition data structures. In this talk I’d like to present how we build our startup on the foundation provided by several NoSQL databases, how we came to choose those systems and how we use them.
Simon Elliston Ball – When to NoSQL and When to Know SQL
With NoSQL, NewSQL and plain old SQL, there are so many tools around it’s not always clear which is the right one for the job.This is a look at a series of NoSQL technologies, comparing them against traditional SQL technology. I’ll compare real use cases and show how they are solved with both NoSQL options, and traditional SQL servers, and then see who wins. We’ll look at some code and architecture examples that fit a variety of NoSQL techniques, and some where SQL is a better answer. We’ll see some big data problems, little data problems, and a bunch of new and old database technologies to find whatever it takes to solve the problem.By the end you’ll hopefully know more NoSQL, and maybe even have a few new tricks with SQL, and what’s more how to choose the right tool for the job.
Ted Dunning – Very High Bandwidth Time Series Database Implementation
This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
Trisha Gee – Building a web application with MongoDB & Java
NoSQL solutions are, apparently, “agile” and easy to prototype with.I’m going to demonstrate what this really means, by creating a simple web application in half an hour, in front of your very eyes. I’m going to show you how using the correct tools for the job, including NoSQL, can make rapid prototyping simple. Some of the technologies I’ll be using are: AngularJS; Bootstrap; HTML 5; Java; MongoDB; and Groovy – a fully buzz-word-compliant application. While I won’t go into every technology in depth, I’ll demonstrate the role of each tool and how they interact.At the end of the talk we will have a fully working mobile-and-browser-friendly application, without compromising on design or good practice.Yes, live coding, with all attendant danger.
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs
There are several challenges in the NoSQL world. Especially if you have very high availability requirements you have to accept temporal inconsistencies which you need to resolve explicitly. This is usually a tough job which requires implementing case by case business logic or even bothering the users to decide about the correct state of your data.Wouldn't it be great if we could solve this conflict resolution and data reconciliation process in a generic way at a pure technical level?That's exactly what CRDTs (Conflict-free Replicated Data Types) are about. CRDTs are data structures that are guaranteed to converge to a desired state while enabling extreme availability of the datastore.In this session you will learn what CRDTs are, how to design them, what you can do with them, what their limitations and tradeoffs are – of course garnished with lots of tips and tricks. Get ready to push the availability of your datastore to the max!