Aegis School of Business, Data Science, Cyber Security & Telecommunication
|Application fee:||0 USD|
This course deals at how to do Machine Learning at a large scale, involving terabytes (may be even petabytes) of data and across several server nodes. The best answer is Apache Spark ML (MLlib)!
What is Spark?
Apache Spark™ is a fast and general engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark is an in-memory analytics engine which runs on top of HDFS and also unifies many other data sources e.g. NoSQL databases like MongoDB or even CSV files. Spark is also a much faster and simpler replacement of Hadoop's original processing model - MapReduce. IBM has announced plans to include Spark in all its analytics platforms and has committed 3,500+ developers to Spark-related projects.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
Write applications quickly in Java, Scala, Python, R.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Combine SQL, streaming, and complex analytics.
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
What is Spark MLlib?
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
What will be covered this course?
2. Dataframes, Datasets, SQL SPARK Streaming
3. SPARK Mlib SPARK GraphX
4. Case studies, applications Project Discussions
5. Final Project