Как проверять гипотезы на данных

Конспект по докладу:


  1. Конечные гипотезы проверяем с помощью статистического критерия. Это математическое правило, формулы и алгоритм их применения.
  2. При использовании статистического критерия смотри на p-value:
  3. Как выбрать статистический критерий?

    Для каждой цели и условий теста есть свой статистический критерий

  4. Вот пример критерия Фишера:



План обучения Data science

SQL: If you can’t get data, you can’t analyze data. Whether you retrieve data from a SQL database or Hadoop cluster with a SQL-language layer on top of it, this is where you start. http://sqlschool.modeanalytics.com/ is a great interactive learning interface. O’Reilley’s SQL Cookbook is a masterpiece that traverses all levels of SQL proficiency.
Full-Stack Data Science: Coursera offers a full stack online curriculum on a continuous basis for a reasonable price. This DOES NOT teach you SQL. If you’re in SF or NYC, you can attend General Assembly’s pricier in-person full stack curriculum. This gives you a cursory introduction to data storage, retrieval, prep, light analysis, and deeper predictive and inferential analysis.
Python: Code Academy or Udemy will teach you the basics. Python can play two functions in the skill stack: 1) to conduct ad-hoc statistical analysis as you would with R, 2) to do everything else. Python is important for the «everything else.» You might use it to get data from APIs, scrape, write ETL jobs, refresh data in your warehouse, or retrain models. This is the piece of the skill stack moves you from being a Static Data Scientist (one who works with data in a manual fashion), to a Live DataScientist (one who has automated many of the processes contributing to data science output, loosely defined).

Basic Statistics: Khan Academy Probability and Statistics.
Linear Algebra and Multivariable Calculus: Go to a local college or Khan Academy to brush up on Multivariable Calculus and Linear Algebra. Their curriculums have largely been the same for the past 5 decades.
Mapreduce/Hadoop: Focus on this last**. There are so many technologies that enable SQL-like interfacing with Hadoop that to know how to write a MapReduce job is, for the most part, not necessary. To build real MapReduce pipelines is a behemoth of a task that might be the work of an early-stage startup Data Scientist, but shouldn’t be if you have a solid BI infrastructure team. This is why companies hire the rockstars we know as backend and data engineers. Side note: if you ever meet one and aren’t sure what their company does, thank them for their service to our country, regardless.
Cleaning: plan to spend most of your time cleaning and transforming in these languages/technologies. The analysis is the fast and fun part.